I’ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I’ve written one. I hope someone else finds it useful.
SQL isn’t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (inverted indexes), suffix trees or tries so as to allow it to do exact comparisons it is very quick, but this isn’t usually possible because data changes so quickly. You can use the brute-force search through text using wildcards, but this isn’t usually appropriate when dealing with large results. Of course, you can use add-in components such as full-text search to do the job for you in SQL Server
What have we got that is useful for comparing text?
- Is the first string contained in the second string?
1234Select case when Patindex ('%first%','second') >0 then 'true' else 'False' end-- useful with wildcard characters-- orSelect case when Charindex ('first','Second')>0 then 'true' else 'False' end
- Are the strings identical?
1Select case when 'first'='Second' then 'true' else 'False' end
- Is the first string at the start of the second string? Charindex (‘first’,’Second’)=1
1Select case when Charindex ('first','Second')=1 then 'true' else 'False' end
- Is the first string at the end of the second string?
1Select case when Patindex ('%first','second') >0 then 'true' else 'False' end
What have we got that is just-about useless?
- Soundex: Difference() and Soundex()
What is missing? Here are the main ones.
- Built-in Regex Searches – CLR routines are fine, but it would be nice to have standardised built-in functions for doing this just as long as nobody expects them to be SARGable.
- Longest common subsequence – What is the longest sequence of characters in common between the two strings
- Longest Common Substring -what is the longest string in common between the two strings: The equivalent of the inner join
- Levenshtein distance -how similar are the two strings? (how many edits are needed to get from one string to the other).
There are several others, such as ‘Hamming Difference’, ‘MostFreqKSDF‘ or the Jaccard Index, for judging the similarity of text. Many of them have esoteric uses in genetics or pattern recognition, whereas we’re much more interested in chores such as data cleansing and input validation.
Plenty of people would argue that we have no real reason to need anything complicated in terms of string comparison in SQL, particularly as it would be hard to scale it up to the size of relations we deal string comparison. There is some truth in this. Even the built-in string functions can be death to a query if used to filter results.
In this blog, I’ll tackle the Longest Common Substring, which is a common, and useful, function that tells you the longest common substring between two strings.
If you, for example, were to compare ‘And the Dish ran away with the Spoon’ with ‘away’, you’d get ‘away’ as being the string in common. Likewise, comparing ‘465932859472109683472’ with ‘697834859472135348’ would give you ‘8594721’
With any SQL function, you have slow performance, but this does not entirely preclude their use. String comparisons tend to be slow by the nature of sequence being so important. In the case of string comparison, you can very quickly get bogged down unless you ‘think relationally’. My first attempt at this routine, copying the standard ‘dynamic’ algorithm, comparing the first 1120 character paragraph of ‘Moby Dick’ with itself, took 28 seconds. A quick change to a more SQL-based way of doing it shortened the comparison to two seconds. Even this is extravagant by C# standards but string comparisons never come cheap.
The routine I ended up with isn’t portable beyond SQL Server or Sybase, since it uses ‘quirky update’. However, if anyone can come up with a more portable method that is as fast, I’d be delighted. I ended up deciding that not only would you want the string that matched, but where it was in both strings and how long it was. Come to think of it, if you are doing a DIFF between strings, then you’d want to know all the substrings, not just the longest one, and so I devised a table function that would give you that if required
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 |
--This will need a NUMBERS table, stocked with numbers. If you haven't got one --this will create it automatically IF NOT EXISTS (SELECT 1 FROM information_Schema.Tables WHERE table_name='Numbers') BEGIN CREATE TABLE [dbo].[Numbers] ( [number] [int], CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED ([number] ASC) ON [PRIMARY] ) ON [PRIMARY] END IF NOT EXISTS(SELECT 1 FROM numbers WHERE number=99999) BEGIN TRUNCATE TABLE numbers ;WITH Digits(i) AS (SELECT i FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (0)) AS X(i)) INSERT INTO numbers(number) SELECT (D6.i*1000000 +D5.i*100000 + D4.i*10000 + D3.i * 1000 + D2.i * 100 + D1.i * 10 + D0.i + 1) AS seq FROM Digits AS D0, Digits AS D1, Digits AS D2, Digits AS D3, Digits AS D4, Digits AS D5, Digits AS D6 END IF OBJECT_ID (N'LongestCommonSubstring') IS NOT NULL DROP FUNCTION LongestCommonSubstring GO CREATE FUNCTION LongestCommonSubstring /** summary: > The longest common subSubstring (LCS) tells you the longest common substring between two strings. If you, for example, were to compare 'And the Dish ran away with the Spoon' with 'away', you'd get 'away' as being the string in common. Likewise, comparing '465932859472109683472' with '697834859472135348' would give you '8594721'. This returns a one-row table that gives you the length and location of the string as well as the string itself. It can easily be modified to give you all the substrings (whatever your criteria for the smallest substring. E.g. two characters? Author: Phil Factor Revision: 1.0 date: 05 Dec 2014 example: code: | Select * from dbo.LongestCommonSubstring ('1234', '1224533324') Select * from dbo.LongestCommonSubstring ('thisisatest', 'testing123testing') Select * from dbo.LongestCommonSubstring ( 'findthishere', 'where is this?') Select * from dbo.LongestCommonSubstring ( null, 'xab') Select * from dbo.LongestCommonSubstring ( 'not beginning-middle-ending', 'beginning-diddle-dum-ending') returns: > the longest common subString as a string **/ ( @firstString VARCHAR(MAX), @SecondString VARCHAR(MAX) ) RETURNS @hit TABLE --returns a single row table --(it is easy to change to return a string but I wanted the location of the match) ( MatchLength INT,--the length of the match. Not necessarily the length of input FirstCharInMatch INT,--first character of match in first string FirstCharInString INT,--first character of match in second string CommonString VARCHAR(8000) --the part of the FirstString successfully matched ) AS BEGIN DECLARE @Order INT, @TheGroup INT, @Sequential INT --this table is used to do a quirky update to enable a grouping only on sequential characters DECLARE @Scratch TABLE (TheRightOrder INT IDENTITY PRIMARY KEY,TheGroup smallint, Sequential INT, FirstOrder smallint, SecondOrder smallint, ch CHAR(1)) --first we reduce the amount of data to those characters in the first string that have a match --in the second, and where they were. INSERT INTO @Scratch ( TheGroup , firstorder, secondorder, ch) SELECT Thefirst.number-TheSecond.number AS TheGroup,Thefirst.number, TheSecond.number, TheSecond.ch FROM --divide up the first string into a table of characters/sequence (SELECT number, SUBSTRING(@FirstString,number,1) AS ch FROM numbers WHERE number <= LEN(@FirstString)) TheFirst INNER JOIN --divide up the second string into a table of characters/sequence (SELECT number, SUBSTRING(@SecondString,number,1) AS ch FROM numbers WHERE number <= LEN(@SecondString)) TheSecond ON Thefirst.ch= Thesecond.ch --do all valid matches ORDER BY Thefirst.number-TheSecond.number, TheSecond.number --now @scratch has all matches in the correct order for checking unbroken sequence SELECT @Order=-1, @TheGroup=-1, @Sequential=0 --initialise everything UPDATE @Scratch --now check by incrementing a value every time a sequence is broken SET @Sequential=Sequential = CASE --if it is not a sequence from the one before increment the variable WHEN secondorder=@order+1 AND TheGroup=@TheGroup THEN @Sequential ELSE @Sequential+1 END, @Order=secondorder, @TheGroup=TheGroup --now we just aggregate it, and choose the first longest match. Easy INSERT INTO @hit (MatchLength,FirstCharInMatch, FirstCharInString,CommonString) SELECT TOP 1 ---just the first. You may want more so feel free to change COUNT(*) AS MatchLength, MIN(firstorder) FirstCharInMatch, MIN(secondorder) AS FirstCharInString, SUBSTRING(@SecondString, MIN(secondorder), COUNT(*)) AS CommonString FROM @scratch GROUP BY TheGroup,Sequential ORDER BY COUNT(*) DESC, MIN(firstOrder) ASC, MIN(SecondOrder) ASC RETURN END--and we do a test run go --do an outer apply to check the obvious flaws and raise an error --if any erros appear. IF EXISTS ( SELECT firstString, secondString,correct, LCS.* FROM (VALUES ('Call me Ishmael. Some years ago...','Something','Some' ), ('unrestfulness','having little or no money in my purse, and nothing particular to interest me on shore','rest' ), ('1234563457','3456','3456' ), ('','',NULL ), (NULL,'',NULL ), ('I find myself involuntarily pausing before coffin warehouses','Jailhouse rock','house'), (',.-=dfgd%','-=','-='), ('protest is useless','I need to test this routine. Tests are valuable','test' ) ) AS X(FirstString,secondString, Correct) OUTER APPLY dbo.LongestCommonSubstring(firstString, secondString) AS LCS WHERE COALESCE(correct,'null')<>COALESCE(LCS.CommonString,'null') ) RAISERROR ('the LongestCommonSubstring routine has broken',16,1) |
Load comments