Deep Insight on Soon-to-be-Public QlikTech
I got to spend a couple of hours on the phone with QlikTech's Hakan Wolge, who wrote 70-80% of the code in QlikView 1.0, and remains in effect QlikTech's chief architect to this day.
QlikTech* finally decided both to become a client and, surely not coincidentally, to give me more technical detail about QlikView than it had when last we talked a couple of years ago. Indeed, I got to spend a couple of hours on the phone not just with Anthony Deighton, but also with QlikTech's Hakan Wolge, who wrote 70-80% of the code in QlikView 1.0, and remains in effect QlikTech's chief architect to this day.
*Or, as it now appears to be called, Qlik Technologies.
Let's start with some quick reminders:
QlikTech makes QlikView, a widely popular business intelligence (BI) tool suite.
QlikView is distinguished by the flexibility of navigation through its user interface.
To support this flexibility, QlikView preloads all data you might want to query into memory.
Let's also dispose of one confusion right up front, namely QlikTech's use of the word associative:
Notwithstanding QlikTech's repeated use of phrases like "QlikView's unique, patented in-memory associative technology," there is nothing "associative" about QlikView's data structures.
Rather, "associative" is a term that can reasonably be used to describe the functionality of QlikView's user interface. In particular, QlikView can "associate" over fields that have the same name, in that it makes it easy for users to join across them.
With that out of the way, let's turn to some highlights of QlikView's underlying technology.
For the most part, QlikView's in-memory data structures are quite simple. In particular:
QlikView data is stored in a straightforward tabular format.
QlikView data is compressed via what QlikTech calls a "symbol table," but I generally call "dictionary" or "token" compression.
QlikView typically gets at its data via scans. There is very little in the way of precomputed aggregates, indexes, and the like. Of course, if the selection happens to be in line with the order in which the records are sorted, you can get great selectivity in a scan.
One advantage of doing token compression is that all the fields in a column wind up being the same length. Thus, QlikView holds its data in nice arrays, so the addresses of individual rows can often be easily calculated.
To get its UI flexibility, QlikView implicitly assumes a star/snowflake schema. That is, there should be no more and no less than one possible join path between any pair of tables. In some cases, this means one will want to rename fields as part of QlikView load scripts. For example,
If two keys are meant to be joined on, you might want to give them the same name.
If two columns have the same name and mean different things (e.g., different kinds of dates), you can give them different names.
You can mark which columns you do or don't want to have "qualified" names - i.e., table-specific modifications that force the names to be unique.
QlikView is designed for gigabytes-scale databases. (More precisely, it's constrained by how much RAM you can address in a single box, and that's how the numbers currently work out.) In particular:
QlikTech recommends 2-4 gigabytes of compressed data per core. QlikTech says 10X is a good rule of thumb for compression, although it sounded like that's a little (not a lot) on the high side when compared simply to raw data.
QlikTech further recommends RAM amounting to another 10% of data size be set aside for each concurrent user (e.g., for cache). However, Hakan said that's really too pessimistic, and in most cases 5% would suffice.
Bottom line: QlikView "comfortably" handles databases with 10-20 gigabytes of compressed data, at whatever product of record count and record length you like. (E.g., 1 billion relatively narrow records.) That's on the order of 100 gigabytes of raw data.
Indeed, several QlikView customers manage several billion records each.
The main ingredient of the performance secret sauce in QlikView is that selections are compiled straight into machine code. (QlikTech gave me the impression that this post is the first time that will be publicly revealed.) Notes on that include:
In the old days, QlikTech thought compilation gave a 10X performance benefit vs. interpreted code. However, 5X might be a more up-to-date figure.
It's not just code; part of the compilation is to create temporary lookup tables.
A single calculation can use multiple cores. QlikTech thinks it's done a very solid job of engineering efficient multicore parallelism. (Note: So far as I could tell, Hakan was using "calculation" to refer both to queries and, well, calculations.)
There's a good reason QlikView runs only on Intel-compatible processors. A port would be painful.
In QlikView's world, one set of users accesses one set of applications against one database on one machine. However, different subsets (or copies of the same subset) of the same underlying database(s) can of course be run on different machines.
Naturally, QlikView caches results and tries to re-use them. One smart thing about QlikView's caching algorithm is that it takes into account the cost of generating the calculated results. This has the happy effect that large result sets, which are often the ones most likely to be useful in a subsequent calculation, are the ones most likely to be retained.
One thing I unfortunately forgot to ask about is loading QlikView data into memory, something that has at times been problematic.
One last thing: QlikTech is going public. That means there is a QlikTech S-1, from which I learned, among other things, that QlikTech now seems to be called Qlik Technologies. Dave Kellogg offers an outstanding overview of the information in QlikTech's filing(s). The points I'd add to Dave's are primarily from the QlikTech balance sheet:
Deferred revenue, which Dave calls out as high in absolute terms, is also growing faster than revenue (or any major component of revenue).
Accounts receivable are also growing faster than revenue or any major component thereof.
One possible explanation is weirdness with international distributors, which is at least potentially consistent with what QlikTech says is a shift in geographical mix.
Another explanation is increasing deal size/complexity, something that is anyway common among enterprise software companies gaining market share, and that is also consistent with what QlikTech says is a growing fraction of revenue coming from existing customers.
I got to spend a couple of hours on the phone with QlikTech's Hakan Wolge, who wrote 70-80% of the code in QlikView 1.0, and remains in effect QlikTech's chief architect to this day.
About the Author
You May Also Like