He's back! Author, SQL expert, puzzlemaster and Ming the Merciless look-alike Joe Celko offers his sage (if offbeat) advice on unique identifiers, validation, verification, trusted sources and relational keys.

Joe Celko, Contributor

August 19, 2007

8 Min Read

Joe Celko
Joe Celko

It's good to be back writing for Intelligent Enterprise after a long break. I was writing for Intelligent Enterprise before it was Intelligent Enterprise. Let me explain. In the early '90s, I had a column ("Celko on SQL") in a magazine named Database Programming & Design. I then moved my column over to DBMS magazine, a competitor, in 1993. Both magazines were later combined into Intelligent Enterprise, which soon after joined the CMP portfolio along with a bunch of other Miller-Freeman publications. Did I mention that I had columns in CMP magazines decades before all of this?

In spite of all the changes, I'm still me. Aristotle's Law of Identity guaranteed that I would not become J. K. Rowling. Darn! This leads to the questions, what is an identifier, why is a thing itself and what are desirable properties of relational keys when we want to model things in a database? (Celko fans, check out Joe's "Can't Be Your Own Supervisor" SQL Puzzle blog and post your own answer.)

Uniqueness is Global, Not Local

Everyone agrees that an identifier should be unique. But new database programmers often think that being locally unique is good enough. This leads them to cheerfully use a proprietary auto-increment or IDENTITY column as a PRIMARY KEY instead of as a proper relational key. Aristotle's Law of Identity is usually stated as "A is A," but there was a lot more to it than a formula. Every entity has a nature and follows its nature; every entity is separate and unique from all others in the universe.

Auto-numbering to build a key is simply dead wrong. It's a local value created in the physical hardware that has nothing to do with the global nature of the entity. If you don't like Aristotle, then consider this geek joke:

"I don't know my mother's phone number, but it's "1" on my speed dial."

"If you lose your phone, how can you call her?"

"I can borrow someone else's phone, dial "1" and get her."

"That's not going to work!"

"Sure it will! I set the Operator to zero and tested it!"

A Natural Key can be inherent characteristics, such as DNA signatures, fingerprints and longitude/latitude pairs. You need a scale and measurement system so you can test them with an instrument to confirm them. I don't know the rules for DNA matching, but in the U.S., fingerprints are classified in the Henry-Galt system (other countries have different systems) and I can get my location with a GPS built into a wrist watch.

National and International Standards are also Natural Keys. For example, the ISO 3779 Vehicle Identification Number (VIN) can be verified by lifting the hood of the car and reading the 17 characters from the engine block. I can use the UPC and EAN barcodes on retail products for ordering and inventory anywhere on Earth.

What makes them Natural Keys is that they can be verified and validated in the real world, and I have a trusted source for the encodings. Let's talk about those traits.

Validation Is Internal

Validation means that when I look at a key, I can tell if it even possible. Validation is internal to the encoding. Take a VIN for example; it has the manufacturer, make and model of the vehicle encoded into it. I can look at the format and determine if it's a valid VIN number because only certain characters appear in certain positions and only certain combinations are allowed. I can look at the VIN and see if it's possible or absurd. Honda does not make a Diablo and Lamborghini does not make a Civic.

There are two common techniques to improve validation. One is adding a check digit to the encoding and the other is a regular expression, but these are topics for another column.

Verification Is External

Verification means that the key value is possible, but I need to find out if it's real or not. Verification is external to the encoding. If the parts of the VIN are in the correct format, I still need to contact the manufacturer to ensure that the VIN was actually issued. If Honda made 1,000,000 Civics, then a VIN for the 1,000,001-th Civic is a fake.

I can go online to a Department of Motor Vehicles and find out if a car has been destroyed or not. Or I can visit several commercial Web sites to get the history (accidents and so on) on any vehicle using its VIN.

Look for a Trusted Source

This leads to the concept of a "Trusted Source," which can provide verification. It's really nice if the trusted source is an industry standard supported by ANSI, ISO, the US Federal Government or some other agency with legal authority to require compliance. But a strong industry tradition or trade association standard will do just as well. People forget that our time zones were set up by the railroads, not the government.

The worst case is to have to set up a trusted source within your enterprise. That means you have to allocate resources for the encoding, do validations and verifications and all that stuff. There is usually no escape from it. Just consider invoice numbers, account numbers and other things that your company does based on its own accounting system.

Over the last few decades, efforts like EDI have attempted to replace more of these local codes with global codes in the general commercial world.

Decide on the Degree of Trust

This leads to the concept of "How trusted is my source?" My local grocery store believes that the check I cash is good, that the address on the check is current and that Texas Drivers license number are correct. If I produced a license with the picture of a tiny Hispanic woman that did not match the name on the check and had an out-of-state address, they would question it. But as long as the ID looks good and has a picture of a bald white male who looks like Ming the Merciless of Mongo, they will probably cash the check.

When I travel to certain countries, I need a birth certificate and a passport. This is a higher degree of trust. For some security settings, I need to provide fingerprints. In some medical situations, I need to provide DNA, which is probably the highest degree of trust.

The degree of trust is related to the degree of risk. Bouncing 1 to 2 percent of the checks at a grocery store is acceptable, and it's computed back into the enterprise. Wrongly executing 1 to 2 percent of the convicted inmates in a prison will lead to a public outcry, so that's leading to more DNA testing.

Invariant Value, Format and Change

Many people believe that a primary key:

Must have an invariant value

Must have an invariant format

They use this as an excuse for auto-numbering and other local, exposed physical locators in place of correct relational keys.

This year, the retail industry in the United States is switching from the 10-digit UPC barcode on products to the 13-digit EAN system and the ISBN (International Standard Book Number) is falling under the same scheme. Clearly, this violates both those assumptions. Hell, even the check digit is different! But the retail industry is still alive and well. Why?

The most important property of a key is that it must ensure uniqueness. But that uniqueness does not have to be eternal. Nor does the format have to be fixed for all time. They simply have to be valid and verifiable at the time I use it.

The retail industry has assured that the old and the new barcodes will identify the same products by a carefully planned migration path that will allow us to change the values and the formats of one of the most common identifiers on Earth. The migration path started with changing the length of the old UPC code columns from 10 to 13 and padding them with leftmost zeros.

In a well-designed RDBMS product, referenced keys are easy to change. Thus, I might have an Inventory table that is referenced in the Orders table. The physical implementation could be a pointer in the Orders table back to the single value in the Inventory table. Or I could use bit vectors that reference domains and simply change the domains. Or it could be a minimal perfect hashing function.

Weird Ideas

One of the weirdest Newsgroup posting I found on this topic was the assertion that industry standard identifiers were just as arbitrary as a locally generated auto-number. The reasoning seemed to be that since both use the same alphanumeric characters, they were alike. Perhaps this "Cowboy Coder," who could ignore ISO, ANSI, the entire EU and every trade organization on Earth that sets standards, wants to replace the Metric System (more properly named "Systme International d'Units" or SI ) with Donald Knuth's "Potrzebie System of Weights and Measures," from MAD magazine #33. Hey, it was invented by a famous computer scientist and it has a decimal base. Since it has a decimal base it is as good as SI!

Joe Celko is an independent consultant in Austin, Texas, and the author of SQL Puzzles and Answers (2006), Joe Celko's SQL for Smarties: Advanced SQL Programming (2005), and Joe Celko's Trees and Hierarchies in SQL for Smarties (2004).

About the Author(s)

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like


More Insights