Many years ago, my wife and I wrote an article for Datamation, a major trade publication at the time, under the title, “Don’t Warehouse Dirty Data!” It’s been referenced quite a few times over the decades but is nowhere to be found using Google these days. The point is, if you have written a report using data, you have no doubt felt the pain of dirty data and it is nothing new.
However, what we never got around to defining was exactly how data gets dirty. Let’s look at some of the ways data get messed up.
Using Incomplete Inputs
I’m going to tell some anecdotal stories based on projects which I worked on many decades ago. One was an automobile tag system for a state government. The legal set up had the state issue the physical tags, but each county (there were over 150 of them) sold tags and added a decal with the name of the county to the plates. Each County Clerk collected the money, put it in a bank account which the clerk controlled (and collected interest on to run the local affairs) and finally, they would have to settle up with the state every few months.
Some of the counties were very large and urban and had their own computer systems; About of the third of the state’s population was concentrated in one major urban area.
Some of the counties were very small and rural and did everything manually. Their data arrived in irregular bursts. The urban counties would send their data on magnetic tapes (that gives you an idea as to how long ago this was) that could immediately be put into what was passing for a database in those days. The rural counties information came in (if you were lucky) on punch cards. But sometimes they came in on paper forms that had to be keypunched at the state capitol.
I don’t think that modern programmers fully appreciate how convenient it is to send information over a network and not have to bother with physical storage devices like tapes and cards.
Because we had over 100 sources of data, with varying levels of quality, we were never sure what was going on. In fact, reconciling the books in time for the next year’s cycle was a nightmare. Our solution was simply to live with it. At least you could say it was not our fault and keep going.
Editor note: Practically no one uses tape and punch cards, but situations like this live on today sending information over the Internet. In fact, I am pretty sure that we have more issues with CSV files these days than they did!
Flaky Inputs
There’s an old joke about a clearly impaired person looking around a lamp post for his car keys. When a stranger offers to help him and asks him where he dropped the keys, he tells them over there in the grass. “So why are you looking here?” asked the stranger. The drunk explains that the light is better under the lamp post than in the grass.
Probably the most common version of this is using outdated information. You can’t get the current figures that you need, so use the last set of figures that you had. All the bases are covered, the data is correct, but it’s not true anymore. Perhaps you don’t need current data. In some cases, this can be true. But you’d better prove it before you assume it.
Because the State had no power to demand timely data, and the counties had no incentive to lose the interest by turning in their revenue in a timely fashion, all the reporting was constantly out of sync.
Today, we have online terminals and systems that can generally force things to work in a timely manner. This is one of the reasons that your creditors want you to use online payments.
Not Inputting Complete Data
The ancient Greek philosopher Diogenes of Sinope met a man who tried to convince him that Poseidon was the great god of the sea by telling him miracle stories about the sailors who prayed to Poseidon and were saved from drowning. “Their pictures and stories are kept in the temple of Poseidon where everyone can see them,” the religious man added. “And where,” asked Diogones, “do they keep the pictures of the sailors who prayed to Poseidon and drowned?”
This error is particularly common in the social sciences. How many treatment programs have you seen that report only the people that completed the program and not how many people entered the program? I think it was an old Pat Paulsen joke that the way to reduce accident statistics would be to simply not report accidents
Repairing The Inputs
“There comes a point when we need to stop just pulling people out of the river; we need to go upstream and find out why they are falling in.” – Bishop Desmond Tutu
Returning to my story about the keypunch machines and the auto tag system, we found that one year there were suddenly an awful lot of farm vehicles being issued tags in a metropolitan area. The problem is traced down to one keypunch machine that had improperly punched control card. Most of you probably have never seen keypunch machines, but the control card was a punch card fixed to a drum which controlled how the machine worked. It would, in effect, hold down a numeric shift key, skip over columns, punch constant values in certain columns and so forth. Google it if you’re interested in the history of this trade and how bad it used to be.
One of the machines had an error in its control card, so it messed up the code for the type of license plate being assigned. The result was that for a couple of months, a major metropolitan area was allegedly full of farm equipment. Try and find that after you’ve already sent out license plates in the mail!
A related problem with keypunch machines was that COBOL files were laid out in fixed width columns. When the auto tags first started, the code for type of tag was one digit in a card column. But then prestige tags came along. and since they are a great source of extra revenue, the number of them exploded. If veterans of one war had a tag, then veterans of another war wanted a tag. Rival universities wanted special tags. Special causes wanted prestige tags. The result was that using zero through nine couldn’t cut it.
Changing the size of the forms and the card layouts was physically impossible. But there was a trick with the old punch card machines – multi-punching! The keypunch operator could hold down a key, prevent the card from advancing, and hit another key, giving a non-EBCDIC configuration of holes in that column. It was up to an input program to read these combinations and translate them into a usable internal code for COBOL programs.
The problem was that the state had both IBM 026 and 029 keypunch machines, as well as some Univac equipment. They did not all multi-punch the same way. The solution was that when the deck punch cards came off one of these machines, you also had to put a card in the front of the deck to tell the program what the source of the card deck was, so it could translate them correctly.
If people got the wrong control card in front of the decks, multi-punches were read incorrectly. Since the multi-punches were usually obscure situations in the encoding, people would not often notice if they were wrong. The result was incorrect data, but usually not enough of it to cause an investigation. We simply accepted the bad data and went on, hoping to fix it later.
First Moral to the story: sometimes your data gets dirty on input. Second moral to the story: sometimes it’s not your fault.
Editor note: it bothers me how easy it is to translate these stories into things that happen today for far more lazy reasons. Reusing columns, adding new values that aren’t defined in the system, just to get data in faster even if it isn’t correct.
Defining Terms
Having multiple definitions for the same terms makes for problems too. In some states, a three wheeled motorcycle is considered a motorcycle; in other states its licensed as if it was a small passenger car. If you move between two such states, the same vehicle can have different auto tags, tax rates and licensing issues.
Naming Dogs And Bad Data Design
Sometimes the design of the data is such that you can’t recover from it and it’s just going to be bad. The American Kennel Club Allows you to register a dog by its breed and give it a name that consist of a name and a postfix. This is how you can have several dogs named Rover.
But instead of using standard Arabic numerals, they decided the postfix should be in Roman numerals and allowed six characters for it. I’m going to guess they were thinking this would make it look like the dog was Royalty; “Rover I”, “Rover II”, “Rover III”, “Rover IV” and so forth. The rules for Roman numerals are little convoluted. A string contains the letters I, V, and X For the values one, five and ten. A string must not contain more than three contiguous repetitions of any one of these characters. If the character of lesser value appears before the character of higher value, then it becomes negative. The value of the string is the sum of the characters. This means that we count I, II, III, IV, V, VI, VII, VIII, IX, X and so forth. The problem is that you can’t get much past thirty when you’re limited to these rules.
If this is what the client handed you, you are stuck. You’re going to have to kludge your way around this with code or hope that your version of SQL has an ORDER BY <column name> [ROMAN | WESTERN] {[ASCENDING | DESCENDING] }
option. I have never seen such a product.
CHECK() Constraints
SQL is a language that actually has the ability to add constraints to the DDL. This is a topic I have written about before. But since traditional programming languages don’t have this option on their files, many programmers just don’t use them. One of my favorite examples was an order entry program that allowed for the “order_quantity” to go negative. This led to a refund in the accounting system and a retirement opportunity for a less than honest employee. The check constraints are tested when data is input, is done. One way, one place at one time so it is consistent. But it’s also cheap to execute.
Imagine that 50 programs insert data into the same table. Even if you’re absolutely sure you can always write the correct code, do you want to recompute the check digit, validate the input range, get the regular expression and whatever else you need to do to assure that this input is correct?
Conclusion
Relational databases give you the ability to avoid a lot of bad and dirty data, so use it whenever you can. It is more upfront work, but it will save you considerable time in the future!
Load comments