Celko on SQL: Natural, Artificial and Surrogate Keys Explained
SQL programming guru Joe Celko offers a classification scheme and advice on using the right keys.
Joe Celko |
There is no such thing as a "universal, one-size-fits-all" key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. I invented a classification scheme years ago that seems to be catching on (see table below). You can decide on the kind of key you want to use based on the nature of your particular situation.
1. A natural key is a subset of attributes which occur in the data model and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. You would also like to have some validation rule, such as look-ups and check digits in a UPC code (See "QUOTE" and reread the parts about validation and verification if you have questions).
2. An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It's up to the DBA to maintain a trusted source for them. For example, a user can assign the open codes in the UPC scheme to products. In the grocery business, this might be where you find bread baked in the store, rather than bread bought from a national bakery chain. The check digits still work the same way to validate the barcodes, but you have to verify them inside your own enterprise.
Another technique is to take a natural key that is made up of many known attributes and put them into a formula to generate a key -- a hashing function. There are several downsides to this approach. The hashing function has to be perfect (that means no collisions or duplicate values from different inputs), and you still have to guarantee the uniqueness of the subset of attributes that make up the natural key. You will sometimes see an artificial key called a "surrogate key," but this is wrong according the definition of a surrogate given by [relational database pioneer] Dr. Edgar F. Codd.
If you have to construct a key yourself, it takes time to design them, to invent a validation rule, set up audit trails, etc. It's not easy work.
3. An "exposed physical locator" is not based on attributes in the data model and is exposed to the user. There is no reasonable way to predict it or verify it, since it usually comes from the physical state of the hardware at the time of data insertion. The system obtains a value thru some physical process in the storage hardware totally unrelated to the logical data model – for example, IDENTITY columns or other proprietary, non-relational auto-numbering devices.
Technically, these are not really keys at all, since they are attributes of the physical storage and are not even part of the logical data model. But they are handy for lazy, non-RDBMS programmers who do not want to research or think! This is the worst way to program in SQL.
4. A surrogate key is system-generated to replace the actual key behind the covers where the user never sees it. It's based on attributes in the table. Example: Teradata hashing algorithms, indexes, pointer chains, ADABASE numbers, etc.
The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. If users can get to the surrogate key, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them.
In "ACM Transactions on Database Systems," Dr. Codd wrote that "…database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them..."
This means that a surrogate ought to act like an index, hash table, bit vector or whatever; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as permanent surrogates for entities:
1. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).
2. Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.
3. It may be necessary to carry information about an Entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree)."
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution is to introduce entity domains that contain system-assigned surrogates.
Physical Appearance of a Key
This is part of designing data, as opposed to designing a database. It's a matter of heuristics, based on human psychology, not on mathematical rules. It might sound a little silly, but they do seem to work. So, let's assume you now have to invent a key from scratch.
1. Fixed length strings are usually better than varying length strings. This gives you another constraint for validation and it is very easy to enforce on both the front and back ends. For example, you can tell if a string of digits is a ZIP code or a telephone number just by counting the digits.
2. Avoid mixing digits and letters in the same positions in the string. This will let you write a very simple regular expression for validation. It also lets a human being figure out the code. For example, the Canadian Postal codes use '[A-Z][0-9][A-Z] -[0-9][A-Z]{0-9]' as their regular expression pattern. It also makes check digits easier if you use only digits (if you don't understand check digits, check out Identification Numbers and Check Digit Schemes, by Joseph Kirkland, at www.maa.org).
3. Allow for growth. That sort of contradicts heuristic #1 about a fixed-length string. But you can use zeroes, blanks or other characters (or define certain code ranges as unused) to provide room for growth.
In the old days of punch cards, the fields were always fixed length, so if you used the digits for a code in a one-character card column and suddenly found you needed more than ten values, you had to use alphas or special characters. This messed up a lot of things. Your sort order now depended on ASCII or EBCDIC hardware. People could not read the codes easily. Special characters varied from key punch machine to key punch machine. If you thought ahead and allowed two-card columns with a leading zero, you were sitting pretty.
Finding a Key
So, how do you start to look for a key when you are setting up a database? I recommend these three steps in the order presented below:
1. Look for an industry standard and the trusted external source that maintains and verifies it. Almost every industry standard code has a regular expression and/or a check digit of some kind for validation. Today, you can Google all of these things or go to industry related chat groups and discover industry standards with less than a week of research even if you are totally ignorant of that industry.
You can often download the code from a Web site. Even better, the industry group that set up the standard will maintain it for you, providing updates for small changes and migration paths in the event of radical changes.
I classify industry standards as natural keys because they share the same properties of familiarity, validation and verification that physically derived natural keys do.
2. Look for a natural key in the attributes. Example: (longitude, latitude) makes a good key for a geographical location. A GPS can be used to verify it. There are some special problems with this kind of key. You need to determine a unit of measure and a scale. I could also use a polar coordinate system for a geographical location.
3. If you must design a new identifier, plan it carefully — especially if people will see and use it. You have to be able to verify it in application programs, so you should have a regular expression, other syntax rules or check digits. You have to be able to verify in the reality of the model or with a trusted source that you maintain.
The "auto-increment" crowd thinks their "fake pointer chains" or "fake record numbers" are surrogate keys. They are actually exposed physical locators. To reiterate an earlier point, there is no such thing as a "universal, one-size-fits-all" key. This is a superstitious belief taken from Kabala and other forms of Numerology. Why would anyone think that everything in the Universe has a mystic number divinely reveled by an internal counter in the current release of one vendor's database product? Now, if you were implementing your RDBMS on a Magic Eight Ball, that would be different!
Don't miss this month's Celko Puzzle on Email Address Validation.
Joe Celko is an independent consultant in Austin, Texas, and the author of SQL Puzzles and Answers (2006), Joe Celko's SQL for Smarties: Advanced SQL Programming (2005), and Joe Celko's Trees and Hierarchies in SQL for Smarties (2004).
About the Author
You May Also Like
Maximizing cloud potential: Building and operating an effective Cloud Center of Excellence (CCoE)
September 10, 2024Radical Automation of ITSM
September 19, 2024Unleash the power of the browser to secure any device in minutes
September 24, 2024