The Database From Hell

Comments 0

Share to social media

The Database From Hell

This article is about two subjects: a database administered by a friend and colleague that we affectionately call the Database From Hell, and Red Gate’s forthcoming SQL Data Generator, currently in beta release. Before we get to talking about The Database From Hell, a brief word about SQL Data Generator and the database:

Data Generator, as its name implies, is all about manufacturing data to test performance, scalability and validity. Beta release 1 already boasts some wonderful features, and more are promised for the next round.

The 65 million row, 600 column, table

The Database From Hell is a marketing database whose real-world implementation comprises a single table with 600+ columns and 65 million rows. Its data derives from dozens of questionnaires. This single table stores respondent information (name, address, zip code, etc.) along with the answers to various questionnaires. Most of the 600+ columns are char(1) or char(2); some are dates recorded as character. The original data comes from a mainframe, gathered from dozens of sources.

Typically, a client wants a list of people who match some combination of attributes. For example, the client might want to launch a campaign whose target is the group of black or Hispanic Americans who are married, own a BMW, have purchased a n HDTV, have more than two children, and make more than $90,000 per year. The variations are almost endless. All these columns contain codes. All are nullable.

My own version differs somewhat from the real-world database. For one thing, I broke the single table into three, called Respondents, Questionnaires and RespondentAnswers. The latter has foreign keys pointing to the former two tables. I also cut down the number of columns, and changed all the char(1) columns to bit columns, primarily to save disk space.

Even in my smaller version, indexing poses a problem. My friend’s approach has been to create a covering index for each new query posed by his client. (A covering index includes all the columns of interest, plus the primary key.) This potentially raises the problem that a very detailed query might use too many columns to be indexed. In addition, every query demands the time taken to create the covering index.

The Test version with only 20 million rows

I don’t have sufficient disk space to create the table on my development machine, so I created a smaller version, with only 40-something columns and only 20 million rows. But I had an idea for a unique approach to this problem, and I wanted proof of concept.

Data Generator to the rescue

Data Generator to the rescue! It can build the required rows very quickly, and then you’re in a position to answer some important questions.

I should point out that Data Generator is actually a collection of data generators, each tailored toward some specific kind of data. For each column in your table(s), you assign a generator. As shipped, the product includes the following generators, each grouped according to their domain and purpose:

Group

Name

Description / Examples

Personal

   

 

Color

AliceBlue, AntiqueWhite

 

Contact Title

Marketing Manager, Owner, Partner

 

Email

Generated email address

 

Email (internet-safe)

A1.example.com

 

Extension

1234

 

First Name

Arthur. Alice, Brian, Betty

 

First Name (female)

Ava, Bette, Charlene

 

First Name (male)

Allan, Boris, Charles

 

Initials

F. ,B., HR

 

Last Name

Smith, Browne, Clarke, Davis

 

Name (full)

Leonard di Capreo

 

Nick Name

Chris, Tommy, JLo

 

Phone Number

(310) 555-1212, 212-4365

 

Social Security Number

950-95-8060

 

Title

Mr., Mrs., Ms., Dr.

Geographical

   

 

Address Line (Street #)

123 W. Second Ave.

 

Address Line 1

Suite 208, Seventh Floor, Bldg. A.

 

Country

Long name of country (Canada)

 

Country Code

3-letter ISO code (USA)

 

Country Code Short

2-letter ISO code (US)

 

P.O. Box

P.O. Box 139, PO Box 140

 

Region

IK-MU, NC-JY

 

US City

New York, Los Angeles, Chicago

 

US State (long name)

California, New York

 

US State code

CA, NY, KT

 

ZIP Code

83243, 96112

 

ZIP+4 Code

96212-1234

Generic

   

 

File List

Generates list from a given file

 

Regex Generator

Used for formatted data

 

SQLColumn

Reads data from a SELECT statement

 

String Generator

Generates strings

Shopping

   

 

Product Name

Deluxe Frommelgrammet

Payment

   

 

Credit Card Date

09/08

 

Credit Card Issue Number

3, 1, 5, 7, 6

 

Credit Card Number

Generated 16-digit number

 

Currency Code

USD, CAN, EUR

Business

   

 

5-digit IDs

67815, 98765

 

Account Number

RET36662

 

Company Name

Sipzapon Corporation, ABC Inc.

 

Department

Sales, Service, Finance

 

Folder (DOS)

C:\Program Files\Red Gate

 

Folder (Web)

http://pen.net/index.html

 

Sector

Aerospace, Horticulture, Transport

 

Source of Leads

Sales, Word of Mouth, Referral

 

URL

http:www.artfulsoftware.com/default.aspx

Despite its variety, if you’re anything like me, you’ll immediately spot holes in the list. For example, Canadian Social Insurance Numbers are formatted differently than US Social Security Numbers. Fortunately, you can create your own custom generators and add them to the list. For more information, see David Connell’s article and C# project (how to write a generator for sql data generator 1.0 beta).

Upon loading the program, the first thing you do is establish your connection to a server and a database. Then Data Generator creates a new Project called Project1. (In my experience, this leads to unnecessary keystrokes. If you mean to use an existing project, you must select File | Open, then respond to the dialog that asks if you want to save the new project, then you can open your old project. I would prefer the program to run without opening a default project, leaving me to create a new one or select from the list of recently used projects.)

With this information, Data Generator reads the database and presents a list of its tables. You select a table and specify table-level settings, notably the data source (generate the data, or use an existing data source such as a table or query) and the number of rows to generate.

Next, you expand the table’s column list, and specify various attributes to apply to the selected column. The attributes you see depend on the nature of the column. You can choose among the data generators relevant to the column type. Depending on how thoroughly you specified the column values and table relationships in your database, some generation-choices may be pre-defined (as in the case of identity keys and also foreign keys). In the case of nullable columns, you may also specify the percentage of nulls you want in the generated data.

When you have specified all the column properties as thoroughly as you wish, you save the project. Next, you can specify project settings, including scripts to run before and after data generation. In the case of the Database From Hell, I couldn’t figure out any way to populate a particular column with either a trigger or a calculated column formula, so I added a post-generation script that called a user-defined function and updated the column.

Wrestling with the Test Database From Hell

My version of the database from Hell was not identical to the real thing. For design reasons, I made some changes. I also used only 50 “answer” columns, due primarily to lack of disk space There are three tables in my version:

  • Respondents: a list of people, 10 million rows.
  • Questionnaires: a list of questionnaires, 1000 rows.
  • RespondentAnswers: one row per respondent-questionnaire combination, containing foreign keys to the other two tables and all the bit and char columns; 20 million rows.

On my hardware, generating this many rows took half an hour, including the time it took to run the post-generation script to update the “special” column. The manufactured data was flawless. All my constraints were respected. For variety’s sake, I supplied a different percentage of nulls for all the bit and char(1) columns.

We could take several approaches in querying this table:

Build up a WHERE clause that mentions the values of interest in any combination of the columns.

Create a covering index just before running each new query, including in its key the columns of interest. There’s a problem, here, of course – we can’t index a bit column. But we could translate them into char(1) columns whose possible values are Y, N and null. This lets us an index that covers the RespondentID and all the columns of interest, but does not escape the need to recreate the index for each new query. But at least the index would quickly isolate the rows of interest.

After trying the first two approaches, I remained dissatisfied with the results, so I decided to try something else, and came up with the idea of the “special” column, an amalgam of the values in the bit and char(1) and char(2) columns. To achieve this, I created a function that examines a bit column and returns one of three character values – Y, N or space. Actually, the UDF I created accepts 40 parameters, all the bit columns in the table. It then examines each value and assembles a 40-character string consisting of some combination of Ys and Ns and space characters. The “special” column is then updated with this result string.

There are three important gains in this approach. First, only one index needs to be created, no matter what the query specification is. Second, it nicely avoids the maximum number of columns that can be included in an index. Finally, all the required values are already present in the index, which eliminates the need to perform a table scan. How? By using the LIKE operator.

An example query:

And another:

For brevity, these examples address only the first several “columns”, but the idea holds for the entire actual column.

In terms of performance, it’s not entirely surprising that the fastest queries specify values of interest for the starting positions within the column. The worst-case queries would not care about the values contained in the beginning columns, and interrogate values near the end of the column. However, even these queries do not need to visit the actual table.

Listing 1 shows the function that I wrote to populate the “special” column.

Data Generator offers the ability to run a pre- and post-generation script. I ran the following update statement afterwards as a post-generation script.

It took slightly over an hour and twelve minutes to populate the column with 20 million values, but as mentioned, I only needed to do this once. Thereafter any of my queries on any combination of the columns could exploit the single index.

The index is a compound index, using the Answer0 column plus the FK pointing to the Respondents table. Thus there is never a need to visit the actual table: all the information we need resides in the index.

Not surprisingly, retrieval times varied with the distance from the left end of the columns of interest. Conversely, values toward the left end of the string resulted in much quicker retrievals. Even the slowest queries, however, were substantially faster than a full table scan.

so?

Obviously, not every database can take advantage of the approach outlined here, but this sort of requirement comes up quite often in databases geared toward marketing applications. If you have such a database, you may find this technique a significant performance booster.

The illustration for this article was derived from some of the illustrations from Eric Liberge’s excellent ‘Tonnerre Rampant’ from Soleil and are used with permission. Many thanks to Eric for his generosity

Article tags

Load comments

About the author

Arthur Fuller

See Profile

Arthur Fuller has been developing database applications for more than 20 years. He frequently works with Access ADPs, Microsoft SQL 2000 and 2005, MySQL, and .NET.

Arthur Fuller's contributions