For most of the time you will get adequate powerful searches from a well-normalized database that has all possible constraints and indexes. Just occasionally, usually when you’re searching through text, or you’ve inherited the work of someone whose major talents lie in other areas of life, you need other magic. This article is about some of the SQL magic that might be required.
The first type of search I’ll refer to is the ‘Brute-force’ search, where performance isn’t the primary concern. Even with the best of intentions, it is possible to find oneself searching whole blocks of text in a database, or searching across columns. Even as an experienced database developer, I find myself occasionally having to do ad-hoc searches for strings in databases, especially in large chunks of text. Anyone like me, who has to mop up after a database goes wrong, has a number of techniques in their metaphorical toolbox for getting to grips with the data.
The second type of search is the ‘search-engine’ type of search when you need to implement clever text-searching facilities in a database, the FullText indexing features of SQL Server (FTS) are great, and should always be the first port of call. By ‘clever’ we mean finding two words near each other, matching any word derived from a particular root (for example ‘search’, ‘searching’, ‘searched’), finding several words with distinct weightings, or doing a fuzzy search for a word or phrase. There are a few rare , but interesting, exceptions when you actually need something more special-purpose and ‘hand-crafted’.
We’ll tackle both types of searches in this article, but not Full-Text Search (FTS). For more on FTS, see Full-Text searches on documents in Filetables,, FTS features, and Rob Sheldon’s Full-Text Indexing Workbench.
Ad-hoc Brute-force searches of databases.
With no great sweat, it is possible to search for a phrase in a table or result using LIKE or PATINDEX. This is often called the ‘brute-force search. In many cases, it is surprisingly effective. I’ve gone into detail in how to use the wildcard features of PATINDEX here in the PATINDEX Workbench, and give an example of its use in a migration script for a database refactoring that involved creating two new columns in an address table.
The blunt instrument – Search anywhere in a query result
For the crudest sort of searches, we need to represent all the data from the result of a SQL query in string-format. The easiest way of accomplishing this is to get the result of a SQL Expression in XML form and then just keep the contents of every element in one big string. To do this, we’ll use this function.
1 2 3 4 5 6 7 8 9 10 11 12 |
IF OBJECT_ID(N'dbo.StripXMLTags') IS NOT NULL DROP FUNCTION dbo.StripXMLTags GO create function dbo.StripXMLTags( @xml XML ) returns varchar(max) as begin declare @TextWithoutTags varchar(max); with TheXMLDocument(contents) as ( select TheXML.node.query('.') from @XML.nodes('/') as TheXML(node) ) select @TextWithoutTags = contents.value('.', 'varchar(max)') from TheXMLDocument return @TextWithoutTags end |
To show how quick and effective this is, we’ll search all columns of 20,000 rows in AdventureWorks’ person.contact table under two seconds */
1 |
Select patindex('%P.O Box 5%',dbo.StripXMLTags((Select * from person.contact for xml path))) |
Note that we’re using the old 2008 version of AdventureWorks. Please amend to taste if using the later version. Of course, if you were looking for a particular string, you could display the context of the first string found, if any (otherwise NULL)
1 2 3 |
with TheText(TheResult) as (Select dbo.StripXMLTags((Select * from person.contact for xml path))) Select case when HitLocation>0 then substring(TheResult,HitLocation-20,40) else null end from (select patindex('%P.O Box 5%',theResult) as hitLocation, Theresult from TheText)f |
If we wanted to know the location of all the strings, then we might put the matches into a table
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Declare @string varchar(max), @len int, @Hit int, @ii int Declare @FoundStrings table (context varchar(40), location int) Select @string= dbo.StripXMLTags((Select * from person.contact for xml path)) Select @len=len(@String),@ii=@len While @ii>0 begin Select @hit= patindex('%hank%',right(@string,@ii)) if @hit=0 break insert into @FoundStrings(Context, Location) Select substring (@String,@len-@ii+@hit,40),@len-@ii+@hit Select @ii=@ii-@hit end Select * from @FoundStrings |
As you can see, it is possible to search the results of any SQL expression you need. It is perfectly possible to automatically search every table in the database if you’re really keen on finding all occurrences of a string.
You’d want to turn it into a function for use
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 |
IF OBJECT_ID(N'dbo.SearchWithinAQuery') IS NOT NULL DROP FUNCTION dbo.SearchWithinAQuery GO create function dbo.SearchWithinAQuery( @xml XML, @SearchPattern Varchar(80) ) /* searches the inner text within an XML document, returning the context of every match. The searches are in PATINDEX format Author: Phil Factor Revision: 1.3 date: 8 Apr 2014 example: - code: SELECT * from dbo.SearchWithinAQuery((select * from person.contact for xml path ),'%hank%') */ RETURNS @FoundStrings TABLE ( context varchar(40), location int --the substring location in which the string is found ) AS begin Declare @string varchar(max), @len int, @Hit int, @ii int Select @string= dbo.StripXMLTags(@xml), @len=len(@String),@ii=@len While @ii>0--loop around until all instances are found begin Select @hit= patindex(@SearchPattern,right(@string,@ii)) if @hit=0 break --if no more are to be found insert into @FoundStrings(Context, Location)--insert each found item Select substring (@String,@len-@ii+@hit,40),@len-@ii+@hit Select @ii=@ii-@hit--decrement the index from the right of the string end Return end |
You might want to use the same ‘PATINDEX’ loop to also search through free-form notes or text documents. Here we look through Bram Stoker’s Dracula to look for instances of the word ‘terror’. (I’ve included the text as a zip file at the bottom of the article – remember to change the path!)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
DECLARE @string VARCHAR(MAX), @len int, @Hit int, @ii int SELECT @string = BulkColumn --read the text of a file FROM OPENROWSET(BULK 'D:\files\dracula.txt', SINGLE_BLOB) AS x --read in the text of Dracula Declare @FoundStrings table (context varchar(40), location int) ---our results table variable Select @len=len(@String),@ii=@len While @ii>0 --while there is some of the string to do begin Select @hit= patindex('%terror%',right(@string,@ii)) if @hit=0 break insert into @FoundStrings(Context, Location) Select substring (@String,@len-@ii+@hit-20,40),@len-@ii+@hit Select @ii=@ii-@hit end Select * from @FoundStrings |
OK. That’s fine if you are just trying to find a particular string, but you could want something a bit more refined.
A sharper stick for free-form searches.
Sometimes, for example, you want a way of selecting customers from your database in a Google-style way, without having to bother with whether it is a surname, town, postcode or customer_ID that you are typing in.
Imagine that you, working for AdventureWorks know only that your customer has the word hacienda in the address but you reckon that his postcode starts with ’94’ With this code, you’ve homed in on Mr Wright from your 18508 customers in half a second
1 2 3 4 5 6 7 8 9 10 11 12 13 |
Select ContactString from (Select coalesce(title+' ','')+firstname+' '+coalesce(middleName+' ','') +lastname+coalesce(' '+suffix,'')+' Email:'+emailAddress+' Phone:'+Phone +' Address: '+AddressLine1+ coalesce(' '+AddressLine2,'')+', '+City+ ' ' +postalcode +' State/Province:'+SP.Name+ ' Country:'+SP.CountryRegionCode as ContactString from person.contact c inner join Sales.Individual I ON C.ContactID = I.ContactID inner join Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID inner join Person.Address AS A ON A.AddressID = CA.AddressID inner join person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID )f where ContactString like '%Hacienda%94%' |
Where databases are relatively small, you can do well with routines that search several columns of views, or derived tables. As soon as the amount of data grows, this becomes a bad idea. Note also that if you didn’t know the order of occurrence within the line, you’d have to do …
1 2 |
... where ContactString like '%Hacienda%94%' or ContactString like '%94%Hacienda%' |
… which doubles the time it takes. We’ll revisit this problem later on after I’ve introduced you to some ‘search-engine’ techniques.
If you have a small database, though, the brute-force approach is quick and easy, but mostly just for ad-hoc work, since these queries tend to have a high CPU load.
The Inversion Technique for Searching databases
Here is a very simple technique, which we’ve used for a long time with a succession of applications. It is usually called the ‘Inverted’ or ‘Inversion’ index technique. (see Search engine indexing for a full discussion). Both FullText indexing and Apache Lucene use it. MapReduce is an elaboration of the same simple technique. The technique scales very well and we’ve tested it indexing up to 8 million documents. Even when searching through eight million documents, It gets results within three seconds if the indexes are right, but you’d expect most searches to be far quicker. In the example code I’ll use to demonstrate, It found strings in the King James Bible in 20 ms that took just under two seconds with a brute-force search. Because each search takes preparation, and any permanent inversion index takes considerable maintenance to cope with insertions, updates and deletes, it isn’t much good for ad-hoc work.
Firstly, we will need to create our inversion index. I generally use two tables, one of which has every word in it that exists within the data you want to search, and another table that stores the location of every instance of every word. The term ‘location’ will mean different things in practice, depending on the problem you want to solve, but in these examples, I will use the publication(or document), offset character location, and sequence (1..n)
First we’ll need a way of chopping text into its constituent words (The Map process). This is a routine that does the job but it is slow. I’ve found nothing as quick in TSQL, but it is an ideal candidate for a CLR. I’ve left one of my quick unit-test routines just to show how I do such things.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 |
/*--------------------*/ IF OBJECT_ID(N'IterativeWordChop') IS NOT NULL DROP FUNCTION IterativeWordChop GO CREATE FUNCTION [dbo].[IterativeWordChop] /* summary: > This Table-valued function takes any text as a parameter and splits it into its constituent words, passing back the order in which they occured and their location in the text. Author: Phil Factor Revision: 1.3 date: 2 Apr 2014 example: - code: SELECT * FROM IterativeWordChop('this tests stuff. Will it work?') - code: SELECT * FROM IterativeWordChop('this ------- tests it again; Will it work ...') - code: SELECT * FROM IterativeWordChop('Do we allow %Wildcards% like %x%?') returns: > Table of SequenceNumber, item (word) and sequence no. **/ ( @string VARCHAR(MAX) ) RETURNS @Results TABLE ( Item VARCHAR(255), location int, Sequence int identity primary key ) AS BEGIN DECLARE @Len INT, @Start INT, @end INT, @Cursor INT,@length INT SELECT @Cursor=1, @len=LEN(@string) WHILE @cursor<@len BEGIN SELECT @start=PATINDEX('%[^A-Za-z0-9][A-Za-z0-9%]%', ' '+SUBSTRING (@string,@cursor,50) )-1 if @start<0 break SELECT @length=PATINDEX('%[^A-Z''a-z0-9-%]%',SUBSTRING (@string,@cursor+@start+1,50)+' ') INSERT INTO @results(item, location) SELECT SUBSTRING(@string,@cursor+@start,@length), @cursor+@start SELECT @Cursor=@Cursor+@Start+@length+1 END RETURN END go ---sanity-check the word-chop routine to make sure it is more or less working if exists (select count(*) from ( Select item, location,sequence FROM IterativeWordChop('Hello. This tests it again; Will it work? ...') union all Select * from ( values ('Hello',1,1), ('This',8,2), ('tests',13,3), ('it',19,4), ('again',22,5), ('Will',29,6), ('it',34,7), ('work',37,8)) CorrectResult(item, location,sequence) )f --our 'correct' values group by sequence, location,item having count(*)<2 ) Raiserror('fault in IterativeWordChop', 16,1) |
Now, armed with a way of chopping up text. We can now create the basic tables. We have one table that contains all the words in our set of publications, and another table that maps the text, word by word, recording the location, sequence and publication. In this first example, we’re only going to have one publication.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 |
--tidy up any existing data from a previous run. IF OBJECT_ID(N'publication') IS NOT NULL DROP table publication GO IF OBJECT_ID(N'WordOccurence') IS NOT NULL DROP table wordOccurence GO IF OBJECT_ID(N'Word') IS NOT NULL DROP table word GO --create a table with the occurrence of each word in it. Create table Word ( Item VARCHAR(255) not null primary key, frequency int not null default(0) ) go --create a table with the word-mapping in it. Create table WordOccurence ( Item VARCHAR(255) not null , location int not null, Sequence int not null, Publication int not null ) ALTER TABLE WordOccurence ADD PRIMARY KEY(item, sequence); Create table publication ( Publication_ID int primary key, TheText Varchar(max) |
Because of the way we want insert the data, the foreign key gets inserted later.
Now we have got this far, we can read the entire King James Bible (which is there on the server’s disk -download it from the head of the article) into our publication table and index it up (map each word) This will hopefully give us enough text to give the algorithm a realistic size of data.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
/* Let's put the text of a large book into a VarChar(MAX)variable */ DECLARE @LotsOfText VARCHAR(MAX) SELECT @LotsOfText = BulkColumn -- 2005 onwards only for this code) FROM OPENROWSET(BULK 'D:\files\kjb.txt', SINGLE_BLOB) AS x insert into publication (Publication_ID, TheText) values(1,@LotsOfText) insert into WordOccurence (Item,location,Sequence,Publication) SELECT Item,location,Sequence,1 FROM Iterativewordchop(@LotsOfText)-- insert into word(item, frequency) Select item, count(*) from WordOccurence group by item ALTER TABLE WordOccurence ADD FOREIGN KEY (item) REFERENCES word |
So we can begin testing this and trying various experiments. Just to kick things off, I’ve implemented a function that will find exact phrases, and show instance where they occur in the text. This receives the words you wish to search for, and then determines how rare they are. It searches only for the rarest in the phrase (you can adjust the proportion.
In practice, I only search for the rarest words in a string and then use a regex to do a brute-force search once the search has been narrowed down to the locations that have the rarest words together in them. This cuts out a lot of tiresome logic! In this case, we use a variant of relational division to find the locations where all the words occur, but it isn’t so much use for fuzzier searches. The fuzzier searches are too complicated to make for an entertaining read in an article, but it is a great exercise to think them up. For a start, you could do quite a lot of fuzzy searches using a LIKE join between your search words and the WORD table.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
IF OBJECT_ID(N'FindString') IS NOT NULL DROP FUNCTION FindString GO CREATE FUNCTION [dbo].[FindString] /* summary: > This Table-valued function takes text as a parameter and tries to find it in the WordOccurence table Author: Phil Factor Revision: 1.3 date: 10 Apr 2014 example: - code: SELECT * FROM FindString('Who is he that hideth counsel without knowledge?',1) - code: SELECT * FROM FindString(' And I turned to see the voice that spake with me. And being turned, I saw seven golden candlesticks') returns: > passes back the location where they were found, and the number of words matched in the string. **/ ( @string VARCHAR(100), @Publication int ) RETURNS @finds table (location int, hits int) AS BEGIN Declare @WordsToLookUp table(Item VARCHAR(255), location int, Sequence int primary key) Declare @wordCount int, @searches int -- chop the string into its constituent words, with the sequence insert into @WordsToLookUp(Item,location,Sequence) Select Item,location,Sequence FROM dbo.Iterativewordchop (@string) -- determine how many words and work out what proportion to search for Select @WordCount =count(*) from @WordsToLookUp Select @searches=case when @wordcount < 3 then @wordcount else 2+(@wordcount/2) end -- insert into @finds (location, hits) Select min(wordOccurence.location), count(*) as matches from WordOccurence inner join (Select top (@searches) word.item, searchterm.sequence from @WordsToLookUp searchterm inner join Word on searchTerm.item=word.item order by frequency ) LessFrequentWords(item,Sequence) on LessFrequentWords.item =WordOccurence.item where publication=@Publication group by wordoccurence.sequence-LessFrequentWords.sequence having count(*)>= @searches order by count(*) desc return end |
So now, armed with this, we can soon become ‘well versed’ in the bible, being able to quote chapter and verse for all sorts of biblical quotations.
1 2 3 4 5 6 7 |
Declare @originalText Varchar(max) Select @originalText=TheText from publication where publication_ID=1 Select substring(@originalText,Location,180), hits FROM FindString('Which is the earnest of our inheritance until the redemption of the purchased possession, unto the praise of his glory.',1) Select substring(@originalText,Location,180), hits FROM FindString('antichrist',1) Select substring(@originalText,Location,60), hits FROM FindString('unto Philadelphia, and unto Laodicea',1) |
I’ve already shown you how to find all instances of the word ‘Terror’ in the book ‘Dracula’ by brute-force. You can substitute Dracula for the KJB so as to use the inversion-index form, and try the two back-to-back. I’ve shown you a simple algorithm, but with everything laid out workbench-style, I guess that you’ll soon find ways of improving performance.
We’ll end with a more typical problem, and one that I’ve had to solve more than once! The application developers want the end users to locate customers or groups of customers by means of a simple google-like search. We negotiate an interface where a stored procedure receives a string containing words and we return all the customers that have all the words in. We allow wildcards
We can start off by creating our inversion tables: very similar to the previous example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
--if the index tables exist, then delete them IF OBJECT_ID(N'CustomerWordOccurence') IS NOT NULL DROP table CustomerwordOccurence IF OBJECT_ID(N'CustomerWord') IS NOT NULL DROP table Customerword GO --the table that holds each word (and its frequency Create table CustomerWord ( Item VARCHAR(255) not null primary key, frequency int not null default(0) ) go --the table that maps customer data, holding each word, and its location Create table CustomerWordOccurence ( Item VARCHAR(255) not null , location int not null, Sequence int not null, Publication int not null ) ALTER TABLE CustomerWordOccurence ADD PRIMARY KEY(item, sequence,publication); |
This time, we can create a text version of each row along with a primary key, and map that text. We use a very similar technique to the previous example , but each row becomes its own ‘publication’
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
/* first we map the text value of each row into the CustomerWordOccurence table, recording the location of each word */ insert into CustomerWordOccurence Select item, location, sequence,contactid from (Select c.contactID, coalesce(title+' ','')+firstname+' '+coalesce(middleName+' ','') +lastname+coalesce(' '+suffix,'')+' Email:'+emailAddress+' Phone:'+Phone +' Address: '+AddressLine1+ coalesce(' '+AddressLine2,'')+', '+City+ ' ' +postalcode +' State/Province:'+SP.Name+ ' Country:'+SP.CountryRegionCode as ContactString from AdventureWorks.person.contact c inner join AdventureWorks.Sales.Individual I ON C.ContactID = I.ContactID inner join AdventureWorks.Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID and AddressTypeID=2 inner join AdventureWorks.Person.Address AS A ON A.AddressID = CA.AddressID inner join AdventureWorks.person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID )f cross apply dbo.iterativeWordChop(contactString) /* then we store each unique word, storing its frequency ( we don't use the frequency in this example code) */ insert into Customerword(item, frequency) Select item, count(*) from CustomerWordOccurence group by item ALTER TABLE CustomerWordOccurence ADD FOREIGN KEY (item) REFERENCES Customerword |
Now we can define our procedure that returns a conventional result corresponding to the AdventureWorks customers (Sorry, AdventureWorks 2008 only) In this case we are allowing wildcards so we use the CustomerWord table to speed up that search (it takes a fifth of the time compared with an inner joing ON LIKE).
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 |
/* and we create the stored procedure that we've agreed with the application programmers.*/ IF OBJECT_ID(N'CustomerAddress') IS NOT NULL DROP procedure CustomerAddress GO CREATE PROCEDURE CustomerAddress /***summary: > Select rows froom AdventureWorks 2008 based on the lookup of the words in the CustomerWordOccurence and CustomerWord tables, in which are stored every word from the data, along with the contactid of the contact Author: Phil Factor Revision: 1.0 date: 14 Apr 2014 example: - code: Execute CustomerAddress 'Andrew california' - code: Execute CustomerAddress '%x% Birmingham england' - code: Execute CustomerAddress '%orge% british columbia' returns: > **/ @SearchString Varchar(200) = '%%' AS BEGIN Declare @WordsToLookUp table(Item VARCHAR(255), location int, Sequence int primary key) Declare @WordCount int insert into @WordsToLookUp(Item,location,Sequence) Select Item,location,Sequence FROM dbo.Iterativewordchop (@SearchString) Select @WordCount =count(*) from @WordsToLookUp Select c.ContactID, NameStyle, Title, FirstName, MiddleName, LastName, Suffix, EmailAddress, EmailPromotion, Phone, AddressLine1,AddressLine2,City, PostalCode,SP.Name as StateProvince from AdventureWorks.person.contact c --select your result inner join AdventureWorks.Sales.Individual I ON C.ContactID = I.ContactID inner join AdventureWorks.Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID and AddressTypeID=2 inner join AdventureWorks.Person.Address AS A ON A.AddressID = CA.AddressID inner join AdventureWorks.person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID where c.contactID in (Select publication from --where you have at least one hit of each (select publication,WLU.Sequence from CustomerWord CW inner join @WordsToLookUp WLU on CW.item like WLU.item inner join CustomerWordOccurence CWO on CW.item=CWO.item group by publication,WLU.Sequence)AllWordMatches group by publication having count(*)=@WordCount) End |
And we can try a few searches out to see what customers AdventureWorks has got.
1 2 3 4 5 |
Execute CustomerAddress 'taylor palo alto' Execute CustomerAddress 'Andrew california' Execute CustomerAddress 'rain drop' Execute CustomerAddress '%x% Birmingham england' Execute CustomerAddress '%orge% british columbia' |
Naturally, in real life, you would need to update the inversion tables every time you do a change to the database. The logic is simple, you just delete all entries in the mapping table that relate to the changed rows and just insert the updated mapping. I’ll not show you how to do that in AdventureWorks, though, as it is pretty routine stuff.
So here we have it. I’ve introduced a number of different techniques that can be used to do searches through data in SQL Server databases. With all the examples, the more useful they are, the more effort in writing them, or maintaining the inversion/mapping tables. In the examples, I’ve tried to keep the code simple and, with the invertion indexes in particular, there are many ingenious optimizations and tricks you can add. I’ve seen versions, for example, that allow complex expressions in the query terms, and attempts to order by relevance. There are all manner of proximity tricks and wildcard techniques. I hope you’ll be moved to try some out!
See Also
- SQL Search: Search for SQL (code, not Data) in your databases, for free
- Apache Lucene Search
- Understanding Full-Text Indexing in SQL Server by Robert Sheldon
- SQL Server Full Text Search Language Features by Hilary Cotter
- SQL Server Full Text Search Language Features, Part 2 by Hilary Cotter
- Full Text Searches on Documents in FileTables by Feodor Geogiev
- Full-Text Indexing Workbench by Robert Sheldon
- Exploring Semantic Search Key Term Relevance by Joe Sack
Load comments