Celko on SQL: Natural, Artificial and Surrogate Keys Explained
SQL programming guru Joe Celko offers a classification scheme and advice on using the right keys.
There is no such thing as a "universal, one-size-fits-all" key. Just as no two sets of entities are the same, the attributes that make them unique have to be found in the reality of the data. I invented a classification scheme years ago that seems to be catching on (see table below). You can decide on the kind of key you want to use based on the nature of your particular situation.
1. A natural key is a subset of attributes which occur in the data model and act as a unique identifier. They are seen by the user. You can go to the external reality and verify them. You would also like to have some validation rule, such as look-ups and check digits in a UPC code (See "QUOTE" and reread the parts about validation and verification if you have questions).
2. An artificial key is an extra attribute added to the table which is seen by the user. It does not exist in the external reality, but can be verified for syntax or check digits inside itself. It's up to the DBA to maintain a trusted source for them. For example, a user can assign the open codes in the UPC scheme to products. In the grocery business, this might be where you find bread baked in the store, rather than bread bought from a national bakery chain. The check digits still work the same way to validate the barcodes, but you have to verify them inside your own enterprise.
Another technique is to take a natural key that is made up of many known attributes and put them into a formula to generate a key -- a hashing function. There are several downsides to this approach. The hashing function has to be perfect (that means no collisions or duplicate values from different inputs), and you still have to guarantee the uniqueness of the subset of attributes that make up the natural key. You will sometimes see an artificial key called a "surrogate key," but this is wrong according the definition of a surrogate given by [relational database pioneer] Dr. Edgar F. Codd.
If you have to construct a key yourself, it takes time to design them, to invent a validation rule, set up audit trails, etc. It's not easy work.
3. An "exposed physical locator" is not based on attributes in the data model and is exposed to the user. There is no reasonable way to predict it or verify it, since it usually comes from the physical state of the hardware at the time of data insertion. The system obtains a value thru some physical process in the storage hardware totally unrelated to the logical data model – for example, IDENTITY columns or other proprietary, non-relational auto-numbering devices.
Technically, these are not really keys at all, since they are attributes of the physical storage and are not even part of the logical data model. But they are handy for lazy, non-RDBMS programmers who do not want to research or think! This is the worst way to program in SQL.
4. A surrogate key is system-generated to replace the actual key behind the covers where the user never sees it. It's based on attributes in the table. Example: Teradata hashing algorithms, indexes, pointer chains, ADABASE numbers, etc.
The fact that you can never see it or use it for DELETE and UPDATE or create it for INSERT is vital. If users can get to the surrogate key, they will screw up the data integrity by getting the real keys and these physical locators out of synch. The system must maintain them.
In "ACM Transactions on Database Systems," Dr. Codd wrote that "…database users may cause the system to generate or delete a surrogate, but they have no control over its value, nor is its value ever displayed to them..."
This means that a surrogate ought to act like an index, hash table, bit vector or whatever; created by the user, managed by the system and NEVER seen by a user. That means never used in queries, DRI or anything else that a user does.
Codd also wrote the following:
"There are three difficulties in employing user-controlled keys as permanent surrogates for entities:
1. The actual values of user-controlled keys are determined by users and must therefore be subject to change by them (e.g. if two companies merge, the two employee databases might be combined with the result that some or all of the serial numbers might be changed.).
2. Two relations may have user-controlled keys defined on distinct domains (e.g. one uses social security, while the other uses employee serial numbers) and yet the entities denoted are the same.
3. It may be necessary to carry information about an Entity either before it has been assigned a user-controlled key value or after it has ceased to have one (e.g. and applicant for a job and a retiree)."
These difficulties have the important consequence that an equi-join on common key values may not yield the same result as a join on common entities. A solution is to introduce entity domains that contain system-assigned surrogates.
The Agile ArchiveWhen it comes to managing data, donít look at backup and archiving systems as burdens and cost centers. A well-designed archive can enhance data protection and restores, ease search and e-discovery efforts, and save money by intelligently moving data from expensive primary storage systems.
2014 Analytics, BI, and Information Management SurveyITís tried for years to simplify data analytics and business intelligence efforts. Have visual analysis tools and Hadoop and NoSQL databases helped? Respondents to our 2014 InformationWeek Analytics, Business Intelligence, and Information Management Survey have a mixed outlook.
Join us for a roundup of the top stories on InformationWeek.com for the week of December 14, 2014. Be here for the show and for the incredible Friday Afternoon Conversation that runs beside the program.