String Comparisons in SQL: The Longest Common Substring.

I’ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I’ve written one. I hope someone else finds it useful.

SQL isn’t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (inverted indexes), suffix trees or tries so as to allow it to do exact comparisons it is very quick, but this isn’t usually possible because data changes so quickly. You can use the brute-force search through text using wildcards, but this isn’t usually appropriate when dealing with large results. Of course, you can use add-in components such as full-text search to do the job for you in SQL Server

What have we got that is useful for comparing text?

Is the first string contained in the second string?

Select case when Patindex ('%first%','second') >0 then 'true' else 'False' end

-- useful with wildcard characters

-- or

Select case when Charindex ('first','Second')>0 then 'true' else 'False' end

Are the strings identical?

1

Select case when 'first'='Second' then 'true' else 'False' end
Is the first string at the start of the second string? Charindex (‘first’,’Second’)=1

1

Select case when Charindex ('first','Second')=1 then 'true' else 'False' end
Is the first string at the end of the second string?

1

Select case when Patindex ('%first','second') >0 then 'true' else 'False' end

What have we got that is just-about useless?

Soundex: Difference() and Soundex()

What is missing? Here are the main ones.

Built-in Regex Searches – CLR routines are fine, but it would be nice to have standardised built-in functions for doing this just as long as nobody expects them to be SARGable.
Longest common subsequence – What is the longest sequence of characters in common between the two strings
Longest Common Substring -what is the longest string in common between the two strings: The equivalent of the inner join
Levenshtein distance -how similar are the two strings? (how many edits are needed to get from one string to the other).

There are several others, such as ‘Hamming Difference’, ‘MostFreqKSDF‘ or the Jaccard Index, for judging the similarity of text. Many of them have esoteric uses in genetics or pattern recognition, whereas we’re much more interested in chores such as data cleansing and input validation.

Plenty of people would argue that we have no real reason to need anything complicated in terms of string comparison in SQL, particularly as it would be hard to scale it up to the size of relations we deal string comparison. There is some truth in this. Even the built-in string functions can be death to a query if used to filter results.

In this blog, I’ll tackle the Longest Common Substring, which is a common, and useful, function that tells you the longest common substring between two strings.

If you, for example, were to compare ‘And the Dish ran away with the Spoon’ with ‘away’, you’d get ‘away’ as being the string in common. Likewise, comparing ‘465932859472109683472’ with ‘697834859472135348’ would give you ‘8594721’

With any SQL function, you have slow performance, but this does not entirely preclude their use. String comparisons tend to be slow by the nature of sequence being so important. In the case of string comparison, you can very quickly get bogged down unless you ‘think relationally’. My first attempt at this routine, copying the standard ‘dynamic’ algorithm, comparing the first 1120 character paragraph of ‘Moby Dick’ with itself, took 28 seconds. A quick change to a more SQL-based way of doing it shortened the comparison to two seconds. Even this is extravagant by C# standards but string comparisons never come cheap.

The routine I ended up with isn’t portable beyond SQL Server or Sybase, since it uses ‘quirky update’. However, if anyone can come up with a more portable method that is as fast, I’d be delighted. I ended up deciding that not only would you want the string that matched, but where it was in both strings and how long it was. Come to think of it, if you are doing a DIFF between strings, then you’d want to know all the substrings, not just the longest one, and so I devised a table function that would give you that if required

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

--This will need a NUMBERS table, stocked with numbers. If you haven't got one

--this will create it automatically

IF NOT EXISTS (SELECT 1 FROM information_Schema.Tables

WHERE table_name='Numbers')

BEGIN

CREATE TABLE [dbo].[Numbers]

(

[number] [int],

CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED ([number] ASC)

ON [PRIMARY]

)

ON [PRIMARY]

END

IF NOT EXISTS(SELECT 1 FROM numbers WHERE number=99999)

BEGIN

TRUNCATE TABLE numbers

;WITH Digits(i) AS

(SELECT i

FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (0)) AS X(i))

INSERT INTO numbers(number)

SELECT (D6.i*1000000 +D5.i*100000 + D4.i*10000 + D3.i * 1000 + D2.i * 100

+ D1.i * 10 + D0.i + 1) AS seq

FROM Digits AS D0, Digits AS D1, Digits AS D2, Digits AS D3,

Digits AS D4, Digits AS D5, Digits AS D6

END

IF OBJECT_ID (N'LongestCommonSubstring') IS NOT NULL

DROP FUNCTION LongestCommonSubstring

CREATE FUNCTION LongestCommonSubstring

/**

summary: >

The longest common subSubstring (LCS) tells you the longest common substring between two strings.

If you, for example, were to compare 'And the Dish ran away with the Spoon' with 'away', you'd

get 'away' as being the string in common. Likewise, comparing '465932859472109683472' with

'697834859472135348' would give you '8594721'. This returns a one-row table that gives you the

length and location of the string as well as the string itself. It can easily be modified to give

you all the substrings (whatever your criteria for the smallest substring. E.g. two characters?

Author: Phil Factor

Revision: 1.0

date: 05 Dec 2014

example:

code: |

Select * from dbo.LongestCommonSubstring ('1234', '1224533324')

Select * from dbo.LongestCommonSubstring ('thisisatest', 'testing123testing')

Select * from dbo.LongestCommonSubstring ( 'findthishere', 'where is this?')

Select * from dbo.LongestCommonSubstring ( null, 'xab')

Select * from dbo.LongestCommonSubstring ( 'not beginning-middle-ending',

'beginning-diddle-dum-ending')

returns: >

the longest common subString as a string

**/

(

@firstString VARCHAR(MAX),

@SecondString VARCHAR(MAX)

)

RETURNS @hit TABLE --returns a single row table

--(it is easy to change to return a string but I wanted the location of the match)

(

MatchLength INT,--the length of the match. Not necessarily the length of input

FirstCharInMatch INT,--first character of match in first string

FirstCharInString INT,--first character of match in second string

CommonString VARCHAR(8000) --the part of the FirstString successfully matched

)

AS BEGIN

DECLARE @Order INT, @TheGroup INT, @Sequential INT

--this table is used to do a quirky update to enable a grouping only on sequential characters

DECLARE @Scratch TABLE (TheRightOrder INT IDENTITY PRIMARY KEY,TheGroup smallint, Sequential INT,

FirstOrder smallint, SecondOrder smallint, ch CHAR(1))

--first we reduce the amount of data to those characters in the first string that have a match

--in the second, and where they were.

INSERT INTO @Scratch ( TheGroup , firstorder, secondorder, ch)

SELECT Thefirst.number-TheSecond.number AS TheGroup,Thefirst.number, TheSecond.number, TheSecond.ch

FROM --divide up the first string into a table of characters/sequence

(SELECT number, SUBSTRING(@FirstString,number,1) AS ch

FROM numbers WHERE number <= LEN(@FirstString)) TheFirst

INNER JOIN --divide up the second string into a table of characters/sequence

(SELECT number, SUBSTRING(@SecondString,number,1) AS ch

FROM numbers WHERE number <= LEN(@SecondString)) TheSecond

ON Thefirst.ch= Thesecond.ch --do all valid matches

ORDER BY Thefirst.number-TheSecond.number, TheSecond.number

--now @scratch has all matches in the correct order for checking unbroken sequence

SELECT @Order=-1, @TheGroup=-1, @Sequential=0 --initialise everything

UPDATE @Scratch --now check by incrementing a value every time a sequence is broken

SET @Sequential=Sequential =

CASE --if it is not a sequence from the one before increment the variable

WHEN secondorder=@order+1 AND TheGroup=@TheGroup

THEN @Sequential ELSE @Sequential+1 END,

@Order=secondorder,

@TheGroup=TheGroup

--now we just aggregate it, and choose the first longest match. Easy

INSERT INTO @hit (MatchLength,FirstCharInMatch, FirstCharInString,CommonString)

SELECT TOP 1 ---just the first. You may want more so feel free to change

COUNT(*) AS MatchLength,

MIN(firstorder) FirstCharInMatch,

MIN(secondorder) AS FirstCharInString,

SUBSTRING(@SecondString,

MIN(secondorder),

COUNT(*)) AS CommonString

FROM @scratch GROUP BY TheGroup,Sequential

ORDER BY COUNT(*) DESC, MIN(firstOrder) ASC, MIN(SecondOrder) ASC

RETURN

END--and we do a test run

--do an outer apply to check the obvious flaws and raise an error

--if any erros appear.

IF EXISTS (

SELECT firstString, secondString,correct, LCS.*

FROM (VALUES

('Call me Ishmael. Some years ago...','Something','Some' ),

('unrestfulness','having little or no money in my purse, and nothing particular to interest me on shore','rest' ),

('1234563457','3456','3456' ),

('','',NULL ),

(NULL,'',NULL ),

('I find myself involuntarily pausing before coffin warehouses','Jailhouse rock','house'),

(',.-=dfgd%','-=','-='),

('protest is useless','I need to test this routine. Tests are valuable','test' )

)

AS X(FirstString,secondString, Correct)

OUTER APPLY dbo.LongestCommonSubstring(firstString, secondString) AS LCS

WHERE COALESCE(correct,'null')<>COALESCE(LCS.CommonString,'null')

)

RAISERROR ('the LongestCommonSubstring routine has broken',16,1)

Register for Simple Talk

String Comparisons in SQL: The Longest Common Substring.

What have we got that is useful for comparing text?

What have we got that is just-about useless?

What is missing? Here are the main ones.

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

What have we got that is useful for comparing text?

What have we got that is just-about useless?

What is missing? Here are the main ones.

Recommended

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases