The Database From Hell
This article is about two subjects: a database administered by a friend and colleague that we affectionately call the Database From Hell, and Red Gate’s forthcoming SQL Data Generator, currently in beta release. Before we get to talking about The Database From Hell, a brief word about SQL Data Generator and the database:
Data Generator, as its name implies, is all about manufacturing data to test performance, scalability and validity. Beta release 1 already boasts some wonderful features, and more are promised for the next round.
The 65 million row, 600 column, table
The Database From Hell is a marketing database whose real-world implementation comprises a single table with 600+ columns and 65 million rows. Its data derives from dozens of questionnaires. This single table stores respondent information (name, address, zip code, etc.) along with the answers to various questionnaires. Most of the 600+ columns are char(1) or char(2); some are dates recorded as character. The original data comes from a mainframe, gathered from dozens of sources.
Typically, a client wants a list of people who match some combination of attributes. For example, the client might want to launch a campaign whose target is the group of black or Hispanic Americans who are married, own a BMW, have purchased a n HDTV, have more than two children, and make more than $90,000 per year. The variations are almost endless. All these columns contain codes. All are nullable.
My own version differs somewhat from the real-world database. For one thing, I broke the single table into three, called Respondents, Questionnaires and RespondentAnswers. The latter has foreign keys pointing to the former two tables. I also cut down the number of columns, and changed all the char(1) columns to bit columns, primarily to save disk space.
Even in my smaller version, indexing poses a problem. My friend’s approach has been to create a covering index for each new query posed by his client. (A covering index includes all the columns of interest, plus the primary key.) This potentially raises the problem that a very detailed query might use too many columns to be indexed. In addition, every query demands the time taken to create the covering index.
The Test version with only 20 million rows
I don’t have sufficient disk space to create the table on my development machine, so I created a smaller version, with only 40-something columns and only 20 million rows. But I had an idea for a unique approach to this problem, and I wanted proof of concept.
Data Generator to the rescue
Data Generator to the rescue! It can build the required rows very quickly, and then you’re in a position to answer some important questions.
I should point out that Data Generator is actually a collection of data generators, each tailored toward some specific kind of data. For each column in your table(s), you assign a generator. As shipped, the product includes the following generators, each grouped according to their domain and purpose:
Group |
Name |
Description / Examples |
Personal |
||
|
Color |
AliceBlue, AntiqueWhite |
|
Contact Title |
Marketing Manager, Owner, Partner |
|
|
Generated email address |
|
Email (internet-safe) |
A1.example.com |
|
Extension |
1234 |
|
First Name |
Arthur. Alice, Brian, Betty |
|
First Name (female) |
Ava, Bette, Charlene |
|
First Name (male) |
Allan, Boris, Charles |
|
Initials |
F. ,B., HR |
|
Last Name |
Smith, Browne, Clarke, Davis |
|
Name (full) |
Leonard di Capreo |
|
Nick Name |
Chris, Tommy, JLo |
|
Phone Number |
(310) 555-1212, 212-4365 |
|
Social Security Number |
950-95-8060 |
|
Title |
Mr., Mrs., Ms., Dr. |
Geographical |
||
|
Address Line (Street #) |
123 W. Second Ave. |
|
Address Line 1 |
Suite 208, Seventh Floor, Bldg. A. |
|
Country |
Long name of country (Canada) |
|
Country Code |
3-letter ISO code (USA) |
|
Country Code Short |
2-letter ISO code (US) |
|
P.O. Box |
P.O. Box 139, PO Box 140 |
|
Region |
IK-MU, NC-JY |
|
US City |
New York, Los Angeles, Chicago |
|
US State (long name) |
California, New York |
|
US State code |
CA, NY, KT |
|
ZIP Code |
83243, 96112 |
|
ZIP+4 Code |
96212-1234 |
Generic |
||
|
File List |
Generates list from a given file |
|
Regex Generator |
Used for formatted data |
|
SQLColumn |
Reads data from a SELECT statement |
|
String Generator |
Generates strings |
Shopping |
||
|
Product Name |
Deluxe Frommelgrammet |
Payment |
||
|
Credit Card Date |
09/08 |
|
Credit Card Issue Number |
3, 1, 5, 7, 6 |
|
Credit Card Number |
Generated 16-digit number |
|
Currency Code |
USD, CAN, EUR |
Business |
||
|
5-digit IDs |
67815, 98765 |
|
Account Number |
RET36662 |
|
Company Name |
Sipzapon Corporation, ABC Inc. |
|
Department |
Sales, Service, Finance |
|
Folder (DOS) |
C:\Program Files\Red Gate |
|
Folder (Web) |
http://pen.net/index.html |
|
Sector |
Aerospace, Horticulture, Transport |
|
Source of Leads |
Sales, Word of Mouth, Referral |
|
URL |
http:www.artfulsoftware.com/default.aspx |
Despite its variety, if you’re anything like me, you’ll immediately spot holes in the list. For example, Canadian Social Insurance Numbers are formatted differently than US Social Security Numbers. Fortunately, you can create your own custom generators and add them to the list. For more information, see David Connell’s article and C# project (how to write a generator for sql data generator 1.0 beta).
Upon loading the program, the first thing you do is establish your connection to a server and a database. Then Data Generator creates a new Project called Project1. (In my experience, this leads to unnecessary keystrokes. If you mean to use an existing project, you must select File | Open, then respond to the dialog that asks if you want to save the new project, then you can open your old project. I would prefer the program to run without opening a default project, leaving me to create a new one or select from the list of recently used projects.)
With this information, Data Generator reads the database and presents a list of its tables. You select a table and specify table-level settings, notably the data source (generate the data, or use an existing data source such as a table or query) and the number of rows to generate.
Next, you expand the table’s column list, and specify various attributes to apply to the selected column. The attributes you see depend on the nature of the column. You can choose among the data generators relevant to the column type. Depending on how thoroughly you specified the column values and table relationships in your database, some generation-choices may be pre-defined (as in the case of identity keys and also foreign keys). In the case of nullable columns, you may also specify the percentage of nulls you want in the generated data.
When you have specified all the column properties as thoroughly as you wish, you save the project. Next, you can specify project settings, including scripts to run before and after data generation. In the case of the Database From Hell, I couldn’t figure out any way to populate a particular column with either a trigger or a calculated column formula, so I added a post-generation script that called a user-defined function and updated the column.
Wrestling with the Test Database From Hell
My version of the database from Hell was not identical to the real thing. For design reasons, I made some changes. I also used only 50 “answer” columns, due primarily to lack of disk space There are three tables in my version:
- Respondents: a list of people, 10 million rows.
- Questionnaires: a list of questionnaires, 1000 rows.
- RespondentAnswers: one row per respondent-questionnaire combination, containing foreign keys to the other two tables and all the bit and char columns; 20 million rows.
On my hardware, generating this many rows took half an hour, including the time it took to run the post-generation script to update the “special” column. The manufactured data was flawless. All my constraints were respected. For variety’s sake, I supplied a different percentage of nulls for all the bit and char(1) columns.
We could take several approaches in querying this table:
Build up a WHERE clause that mentions the values of interest in any combination of the columns.
Create a covering index just before running each new query, including in its key the columns of interest. There’s a problem, here, of course – we can’t index a bit column. But we could translate them into char(1) columns whose possible values are Y, N and null. This lets us an index that covers the RespondentID and all the columns of interest, but does not escape the need to recreate the index for each new query. But at least the index would quickly isolate the rows of interest.
After trying the first two approaches, I remained dissatisfied with the results, so I decided to try something else, and came up with the idea of the “special” column, an amalgam of the values in the bit and char(1) and char(2) columns. To achieve this, I created a function that examines a bit column and returns one of three character values – Y, N or space. Actually, the UDF I created accepts 40 parameters, all the bit columns in the table. It then examines each value and assembles a 40-character string consisting of some combination of Ys and Ns and space characters. The “special” column is then updated with this result string.
There are three important gains in this approach. First, only one index needs to be created, no matter what the query specification is. Second, it nicely avoids the maximum number of columns that can be included in an index. Finally, all the required values are already present in the index, which eliminates the need to perform a table scan. How? By using the LIKE operator.
An example query:
1 2 3 4 5 6 7 |
SELECT RespondentAnswerID, RespondentID, QuestionnaireID, Answer0 FROM dbo.RespondentAnswers WHERE Answer0 LIKE 'Y_NN_?N%' -- no rows returned; 13 seconds |
And another:
1 2 3 4 5 6 7 |
SELECT RespondentAnswerID, RespondentID, QuestionnaireID, Answer0 FROM dbo.RespondentAnswers WHERE Answer0 LIKE 'NNN NYYY%' -- 3913 rows returned in 8 seconds |
For brevity, these examples address only the first several “columns”, but the idea holds for the entire actual column.
In terms of performance, it’s not entirely surprising that the fastest queries specify values of interest for the starting positions within the column. The worst-case queries would not care about the values contained in the beginning columns, and interrogate values near the end of the column. However, even these queries do not need to visit the actual table.
Listing 1 shows the function that I wrote to populate the “special” column.
Data Generator offers the ability to run a pre- and post-generation script. I ran the following update statement afterwards as a post-generation script.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
UPDATE dbo.RespondentAnswers SET Answer0 = dbo.BitToChar_ufn( Answer1, Answer2, Answer3, Answer4, Answer5 ,Answer6, Answer7, Answer8, Answer9, Answer10 ,Answer11, Answer12, Answer13, Answer14, Answer15 ,Answer16, Answer17, Answer18, Answer19, Answer20 ,Answer21, Answer22, Answer23, Answer24, Answer25 ,Answer26, Answer27, Answer28, Answer29, Answer30 ,Answer31, Answer32, Answer33, Answer34, Answer35 ,Answer36, Answer37, Answer38, Answer39, Answer40 ) |
It took slightly over an hour and twelve minutes to populate the column with 20 million values, but as mentioned, I only needed to do this once. Thereafter any of my queries on any combination of the columns could exploit the single index.
The index is a compound index, using the Answer0 column plus the FK pointing to the Respondents table. Thus there is never a need to visit the actual table: all the information we need resides in the index.
Not surprisingly, retrieval times varied with the distance from the left end of the columns of interest. Conversely, values toward the left end of the string resulted in much quicker retrievals. Even the slowest queries, however, were substantially faster than a full table scan.
so?
Obviously, not every database can take advantage of the approach outlined here, but this sort of requirement comes up quite often in databases geared toward marketing applications. If you have such a database, you may find this technique a significant performance booster.
The illustration for this article was derived from some of the illustrations from Eric Liberge’s excellent ‘Tonnerre Rampant’ from Soleil and are used with permission. Many thanks to Eric for his generosity
Load comments