Data visualization is undergoing a revolution, making complex data sets easier to understand and helping both experienced and inexperienced analysts form better conclusions and takeaways from those numbers.
A notable side effect of increased capabilities for data visualization is a push toward more complex modes of data collection and processing; if we’re able to understand complex data sets without needing substantial training or experience, we can apply those data processing standards to more areas.
Enter high dimensional analytics
In the era of big data, we’ve been able to collect and store more data points than ever before. Rather than relying on simple bits of information about key demographics and behaviors, we have access to hundreds, and sometimes thousands of variables related to a given problem or outcome. For example, in medical research fields, characteristics include genetic predispositions, lifestyle factors, and demographic information may all play a role in whether a patient develops a condition (and how they respond to treatment). Each of these hundreds of variables may interact with any of the other variables, making it impossible to do a simple correlational analysis in variable pairs or triplets.
It's difficult to imagine anything in more than three dimensions, but for computers, it’s relatively easy. In physics and computer science, mathematical models can be used to make calculations in higher dimensions, sometimes hundreds of dimensions, allowing us to crunch the numbers and uncover patterns. There’s only one significant obstacle to making this practical: visualizing the results.
Visualizing high dimensions
The simplest model of data visualization is also the first one most of us are introduced to: the bar graph, in which one set of variables is plotted on the horizontal x-axis, and another is plotted on the vertical y-axis. This is highly effective, but only extends to two dimensions of data.
Researchers have developed multiple techniques to push the limits of what we can visualize, and most of them focus on reducing the number of presentable dimensions, in some way, to three or four. It’s exceedingly difficult for humans to think conceptually in dimensions beyond what we’re familiar with (three spatial dimensions and one time-like dimension), so the solution is to find a way to efficiently translate high-dimensional findings into those dimensions. Sometimes, that means using analytics to filter out the “noise” within the variables, reducing them to only what’s most important. Other times, that means clustering variables together.
So how do three- and four-dimensional projections work? In three dimensions, you can add a third axis, perpendicular to both x and y, known as the z-axis, to turn your graph into a three-dimensional representation. Virtual systems allow for more in-depth interaction with these projections, especially when you layer in elements of augmented reality, allowing participants to see individual data points in a three-dimensional cross section the way they might see fish in an aquarium. If you use the progression of time to layer in a fourth dimension, you can introduce even more complexity.
As an illustrative practical example, Google developers have used high dimensional analytics and visualization experimentally to “teach” a computer the meaning of language. Rather than giving the system any information about how words relate to each other, researchers “fed” it millions of examples of writing, and the system started mapping relationships in high dimensions to associate different types of words with one another. Researchers then used simplified three-dimensional models to visualize different areas of its findings, realizing it had successfully grouped words of similar meanings. For example, words that describe colors were grouped together, and words that describe numbers were grouped together.
Challenges for high dimensional data visualization
Before you get too excited about being able to “see” how your customers change over time, or how productive your employees are, you should know there are some key limits and challenges for high dimensional data visualization:
- Dimension reduction. As most critical researchers are quick to point out, any visualization of high dimensional data currently requires some form of dimension reduction. Some are more efficient than others, but you’ll always lose some bit of information or data integrity when doing this.
- The curse of dimensionality. We’re so used to thinking in low dimensional space, we tend to neglect problems like the “curse of dimensionality,” with one definition stating that projections into higher dimensions make data become sparse, leading to lower statistical significance of findings. Depending on who you ask, the “curse of dimensionality” may refer to one of several different problems.
- Overconfidence. Data visualization always comes with a risk of analyst overconfidence; it’s easy to let your instincts take over and assume the graphical representations are showing you all the right parts of the data, since you’re relying on your own ability to spot patterns. Important outliers are more difficult to detect, and you may become prone to problems like confirmation bias, especially if you’re in control of how those high dimensions are being reduced.
Still, high dimensional data is our greatest asset in learning from data sets with hundreds of variables (or more). Once we learn to visualize it effectively, we’ll be able to intuit conclusions far easier and more naturally.