The NSA And Big Data: What IT Can Learn
Enterprises can put the tools Big Brother uses to analyze our online activities to productive use. Here's how.
InformationWeek Green - July 22, 2013
InformationWeek Green
Download the entire July 22, 2013, issue of InformationWeek, distributed in an all-digital format (registration required).
Whatever your civil liberties stance, the technologies underpinning the National Security Agency's data collection and analysis programs, such as PRISM, spell opportunity for companies looking to connect a different set of dots -- identifying potential customers, spotting fraud or cybercrime in its early stages, or improving products and services.
The pillars of the NSA's architecture are big data systems, particularly a distributed data store called Accumulo, machine learning and natural language processing software, and scale-out cloud hardware (we delve into all three in much more depth in our full report).
>> Big data software: The NSA needed to store, process and analyze incredible amounts of both structured and unstructured data. It liked Google's BigTable architecture but saw it lacked some necessary features, which the agency added. But BigTable provides only the data store; you still need to process, analyze and draw correlations across large, distributed data sets. For that, the NSA rolled the Hadoop open source code into Accumulo -- essentially a big data storage and application platform in a box, with some interesting new wrinkles. Ely Kahn, a co-founder of Sqrrl, a startup he and some former NSA developers launched to commercialize Accumulo, says the NSA had two main requirements in a big data system that weren't met by existing technology: better security via extremely granular access control and massive scalability. The NSA's major contribution to the architecture is stronger security, which in Accumulo means new attributes for individual data elements. Accumulo is also very scalable. Kahn says people using HBase, a popular column-oriented NoSQL database that runs on top of HDFS, find it difficult to scale past a few hundred storage nodes. In contrast, Accumulo instances with thousands of nodes have been in production for years.
>> Machine learning and natural language processing: Another important piece of the NSA's technology arsenal is machine learning -- building adaptive, self-tuning systems that automatically evaluate incoming data to improve performance, update search queries, interpret ambiguous phrases or identify objects in digital images. Much like Google improves search results by recording metadata about past searches, clicked links and Gmail document text, the NSA's systems can apply context to textual analysis to determine whether the phrase "this plot will destroy him" refers to an assassination scheme or a critic's opinion that a lousy novel will ruin the author's reputation.
For example, one open NSA project, KODA, can automatically create summaries of large sets of textual data using only the text itself. Current search engines and content management systems typically use dictionaries or standard text samples to search and categorize data; however, that approach breaks down when applied to large volumes of information. KODA measures the similarity between passages of text and selects sentences to summarize, and it performs well on large data sets in many languages regardless of formatting.
>> Hardware and networks: Just as the NSA re-created the big data analysis features of a Google and Facebook in software, so too is it replicating their data storage and processing capabilities in hardware, notably a new cloud-scale data center under construction in Utah. It's a massive Tier 3 facility comprising 100,000 square feet of raised floor space in four data halls. Similarly state-of-the-art facilities are readily available to enterprises through a rich and competitive market for colocation space.
Lessons For Businesses
The NSA advances the state-of-the-big-data-art in a few ways. Kahn says Sqrrl sees three primary customer segments for its NSA-inspired offerings: current Hadoop users, healthcare companies and very security-conscious orgs; we discuss these use cases in our full report. Within those segments, here are some applications where NSA tech holds promise:
>> Real time versus batch analytics: Accumulo's server-side programming features let many analysis tasks operate continuously, thus providing real-time access to analytics features, something not possible with Hadoop-based systems because of Hadoop's batch orientation. A quick word about Accumulo: Though it could displace Hadoop/HDFS while improving scalability, real-time performance and security, there are downsides. First, it's not widely used outside of government. Unless you use Sqrrl (which is a very new company), you have to manually install from the open source repository, so there's significant expertise and care and feeding required. And unlike Hadoop/MapReduce and HDFS/ HBase, no public cloud services yet offer Accumulo instances, though Amazon offers a tutorial. For applications, new features like security labels and iterators require Accumulo-savvy developers -- a very small, and expensive, group. And finally, don't underestimate the incentive for incumbents like Microsoft and Oracle to squash the technology.
>> Data mining and graph analysis: Perhaps the primary application of the NSA PRISM program is exposing social connections and communication paths among millions of separate data points. In fact, the NSA used graph analysis earlier this year to showcase Accumulo's speed and scalability, describing a system running a standard graph benchmark with 4.4 trillion nodes and 70 trillion edges. In comparison, Facebook's Graph Search currently scales to hundreds of billions of nodes and trillions of edges. For enterprises, the advantages to doing data mining and graph analysis in a system like Accumulo are twofold. It provides speed because the server-side processing can pre-cache certain common operations, and security is much more granular, down to the query level.
>> Predictive analytics: Companies want to use vast quantities of data from multiple sources -- sales transactions, customer loyalty cards, economic data, Twitter comments, even the weather -- to model and predict events such as product demand or customer interest in promotion. In such predictive uses, the more data and the faster the processing, the better. Thus, the scalability and granular access control of the NSA's technology make it a promising platform, previously mentioned caveats aside. Kahn also sees interest in predictive analytics for security, since traditional SIEM and security forensics software can't handle the volume, diversity and complexity of data needed to identify new threats and incidents before they do damage.
Finally, there's the Internet of things. Accumulo's scalability and real-time filtering and categorizing features make it a promising data repository for telemetry using millions of physical objects, with applications allowing interactive data mining across varied object and customer types. We expect early IoT adopters, which likely can afford to hire the needed expertise, will find Accumulo an excellent repository and application platform.
About the Author
You May Also Like