Searching for Strings in SQL Server Databases

Comments 0

Share to social media

For most of the time you will get adequate powerful searches from a well-normalized database that has all possible constraints and indexes. Just occasionally, usually when you’re searching through text, or you’ve inherited the work of someone whose major talents lie in other areas of life, you need other magic. This article is about some of the SQL magic that might be required.

The first type of search I’ll refer to is the ‘Brute-force’ search, where performance isn’t the primary concern. Even with the best of intentions, it is possible to find oneself searching whole blocks of text in a database, or searching across columns. Even as an experienced database developer, I find myself occasionally having to do ad-hoc searches for strings in databases, especially in large chunks of text. Anyone like me, who has to mop up after a database goes wrong, has a number of techniques in their metaphorical toolbox for getting to grips with the data.

The second type of search is the ‘search-engine’ type of search when you need to implement clever text-searching facilities in a database, the FullText indexing features of SQL Server (FTS) are great, and should always be the first port of call. By ‘clever’ we mean finding two words near each other, matching any word derived from a particular root (for example ‘search’, ‘searching’, ‘searched’), finding several words with distinct weightings, or doing a fuzzy search for a word or phrase. There are a few rare , but interesting, exceptions when you actually need something more special-purpose and ‘hand-crafted’.

We’ll tackle both types of searches in this article, but not Full-Text Search (FTS). For more on FTS, see Full-Text searches on documents in Filetables,, FTS features, and Rob Sheldon’s Full-Text Indexing Workbench.

Ad-hoc Brute-force searches of databases.

With no great sweat, it is possible to search for a phrase in a table or result using LIKE or PATINDEX. This is often called the ‘brute-force search. In many cases, it is surprisingly effective. I’ve gone into detail in how to use the wildcard features of PATINDEX here in the PATINDEX Workbench, and give an example of its use in a migration script for a database refactoring that involved creating two new columns in an address table.

The blunt instrument – Search anywhere in a query result

For the crudest sort of searches, we need to represent all the data from the result of a SQL query in string-format. The easiest way of accomplishing this is to get the result of a SQL Expression in XML form and then just keep the contents of every element in one big string. To do this, we’ll use this function.

To show how quick and effective this is, we’ll  search all columns of 20,000 rows in AdventureWorks’ person.contact table under two seconds */

Note that we’re using the old 2008 version of AdventureWorks. Please amend to taste if using the later version. Of course, if you were looking for a particular string, you could display the context of the first string found, if any (otherwise NULL)

If we wanted to know the location of all the strings, then we might put the matches into a table

As you can see, it is possible to search the results of any SQL expression you need.  It is perfectly possible to automatically search every table in the database if you’re really keen on finding all occurrences of a string.

You’d want to turn it into a function for use

You might want to use the same ‘PATINDEX’ loop to also search through free-form notes or text documents. Here we look through Bram Stoker’s Dracula to look for instances of the word ‘terror’. (I’ve included the text as a zip file at the bottom of the article – remember to change the path!)

OK. That’s fine if you are just trying to find a particular string, but you could want something a bit more refined.

A sharper stick for free-form searches.

Sometimes, for example, you want a way of selecting customers from your database in a Google-style way, without having to bother with whether it is a surname, town, postcode or customer_ID that you are typing in.

Imagine that you, working for AdventureWorks know only that your customer has the word hacienda in the address but you reckon that his postcode starts with ’94’ With this code, you’ve homed in on Mr Wright from your 18508 customers in half a second

Where databases are relatively small, you can do well with routines that search several columns of views, or derived tables. As soon as the amount of data grows, this becomes a bad idea. Note also that if you didn’t know the order of occurrence within the line, you’d have to  do …

… which doubles the time it takes. We’ll revisit this problem later on after I’ve introduced you to some ‘search-engine’ techniques.

If you have a small database, though, the brute-force approach is quick and easy, but mostly just for ad-hoc work, since these queries tend to have a high CPU load.  

The Inversion Technique for Searching databases

Here is a very simple technique, which we’ve used for a long time with  a succession of applications. It is usually called the ‘Inverted’ or  ‘Inversion’ index technique. (see Search engine indexing for a full discussion). Both FullText indexing and  Apache Lucene use it. MapReduce is an elaboration of the same simple technique. The technique scales very well and we’ve tested  it indexing up to 8 million documents. Even when searching through eight million documents, It gets results within three seconds if the indexes are right, but you’d expect most searches to be far quicker. In the example code I’ll use to demonstrate, It found strings in the King James Bible in 20 ms that took just under two seconds with a brute-force search. Because each search takes preparation, and any permanent inversion index takes considerable maintenance to cope with insertions, updates and deletes, it isn’t much good for ad-hoc work.

Firstly, we will need to create our inversion index. I generally use two tables, one of which has every word in it that exists within the data you want to search, and another table that stores the location of every instance of every word. The term ‘location’ will mean different things in practice, depending on the problem you want to solve, but in these examples, I will use the publication(or document), offset character location, and sequence (1..n)

First we’ll need a way of chopping text into its constituent words (The Map process). This is a routine that does the job but it is slow. I’ve found nothing as quick in TSQL, but it is an ideal candidate for a CLR. I’ve left one of my quick unit-test routines just to show how I do such things.

Now, armed with a way of chopping up text. We can now  create the basic tables. We have one table that contains all the words in our set of publications, and another table that maps the text, word by word, recording the location, sequence and publication. In this first example, we’re only going to have one publication.

Because of the way we want insert the data, the foreign key gets inserted later.

Now we have got this far, we can read the entire King James Bible (which is there on the server’s disk -download it from the head of the article) into our publication table and index it up (map each word) This will hopefully give us enough text to give the algorithm a realistic size of data.

So we can begin testing this and trying various experiments. Just to kick things off, I’ve implemented a function that will find exact phrases, and show instance where they occur in the text. This receives the words you wish to search for, and then determines how rare they are. It searches only for the rarest in the phrase (you can adjust the proportion.

In practice, I only search for the rarest words in a string and then use a regex to do a brute-force search once the search has been narrowed down to the locations that have the rarest words together in them. This cuts out a lot of tiresome logic! In this case, we use a variant of relational division to find the locations where all the words occur, but it isn’t so much use for fuzzier searches. The fuzzier searches are too complicated to make for an entertaining read in an article, but it is a great exercise to think them up. For a start, you could do quite a lot of fuzzy searches using a LIKE join between your search words and the WORD table.

So now, armed with this, we can soon become ‘well versed’ in the bible, being able to quote chapter and verse for all sorts of biblical quotations.

I’ve already shown you how to find all instances of the word ‘Terror’ in the book ‘Dracula’ by brute-force. You can substitute Dracula for the KJB so as to  use the inversion-index form,  and try the two back-to-back. I’ve shown you a simple algorithm, but with everything laid out workbench-style, I guess that you’ll soon find ways of improving performance.

 

We’ll end with a more typical problem, and one that I’ve had to solve more than once! The application developers want the end users to locate customers or groups of customers by means of a simple google-like search. We negotiate an interface where a stored procedure receives a string containing words and we return all the customers that have all the words in. We allow wildcards

We can start off by creating our inversion tables: very similar to the previous example.

This time, we can create a text version of each row along with a primary key, and map that text. We use a very similar technique to the previous example , but each row becomes its own ‘publication’

Now we can define our procedure that returns a conventional result corresponding to the AdventureWorks customers (Sorry, AdventureWorks 2008 only) In this case we are allowing wildcards so we use the CustomerWord table to speed up that search (it takes a fifth of the time compared with an inner joing ON LIKE).   

And we can try a few searches out to see what customers AdventureWorks has got.

Naturally, in real life, you would need to update the inversion tables every time you do a change to the database. The logic is simple, you just delete all entries in the mapping table that relate to the changed rows and just insert the updated mapping. I’ll not show you how to do that in AdventureWorks, though, as it is pretty routine stuff.

 

So here we have it. I’ve introduced a number of different techniques that can be used to do searches through data in SQL Server databases. With all the examples, the more useful they are, the more effort in writing them, or maintaining the inversion/mapping tables. In the examples, I’ve tried to keep the code simple and, with the invertion indexes in particular, there are many ingenious optimizations and tricks you can add. I’ve seen versions, for example, that allow complex expressions in the query terms, and attempts to order by relevance. There are all manner of proximity tricks and wildcard techniques. I hope you’ll be moved to try some out!

See Also

Load comments

About the author

Phil Factor

See Profile

Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 40 years of experience with database-intensive applications. Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career. See also :

Phil Factor's contributions