Just as consistent columnar data aids compression, you can improve compression optimization by sorting data before loading. comScore uses Syncsort DMExpress software to sort data alphanumerically before it's loaded into Sybase IQ. Where 10 bytes of unsorted data can be compressed to three or four bytes, says Michael Brown, comScore's chief technology officer, pictured above, 10 bytes of sorted data can typically be crunched down to one byte. "That makes a huge difference in the volume of data we have to store," Brown says.
Sorting also can streamline processing. comScore sorts URL data to minimize Web site taxonomy lookups. Instead of loading the 40 URLs for Web site pages in the order they were visited during a session, sorting might reveal that 20 of those pages were on Facebook, 12 were on GMail and the balance were at NYTimes.com. The sorted data would trigger just three site lookups whereas unsorted data might trigger many redundant lookups if the visitor bounced back and forth among just a few sites. "That saves a lot of CPU time and a lot of effort," Brown says. It's possible to sort data with SQL statements, and custom scripts, but sorting is also a common feature in data-integration software from IBM, Informatica, Oracle, SAP, SAS, Syncsort, and others. At truly large scale, Hadoop is an option for sorting and other processing steps.