Open Source: Key To Big Data Riches?

Big data makes a company smart, open source makes it rich, Gartner report contends.

Even though big data is approaching the peak of its hype cycle, there is value to be had from it immediately, according to a new Gartner report.

Despite its high potential for delivering short-term value by identifying customers' behavior or attitudes, however, big data's long-term value rests in the ability and willingness of both vendors and users to make heavily processed, manipulated big data stores open to tools other than those designed specifically for high-intensity analytics, according to David Newman, research VP at Gartner.

Big data is an imprecise term describing the amalgamation of many types of data from many sources--often into databases that push the upper boundaries of the hardware and software assigned to manage them.

The leading file-management/database software used to handle big data--Hadoop--is already open source, as are many of the tools purpose-built to add new functions.

The data sets themselves are a different story, however.

Building a useful base of big data is an extremely complex process that requires careful selection of data sources, parsing the available sources to pick only the data that is appropriate in date, content, and context, and an even more rigorous series of efforts to remove duplicate, corrupt, or inappropriate data, convert the remainder to a common format, and store the lot with a database manager able to handle the volume, variety, and occasional conflicts among insufficiently processed bits.

[ Learn about the biggest Big Data Development Challenges: Talent, Cost, Time. ]

"There are obvious integration issues when you're taking data from server logs and social networks and other non-standard sources, especially human-generated content," according to Mike Boyarski, director of product marketing for big data tools vendor Jaspersoft. "You have to be able to cull through the data and not create changes based on your cull, and you've got to prove the data are still correct and relevant. You need more than just the ability to collect data cheaply."

Once all that work is done and the big data set has answered the questions at hand, however, both the culled data and the work that went into it are going to waste if the data can only be used for that one purpose.

The cost of big data makes the most sense when its architects are able to use publicly available APIs, data conversion utilities, or common data and query formats to pull in additional data, transfer culled and cleaned data to a data broker that can pay for the privilege, or give employees access to the data through existing analytics or business intelligence applications.

"There is a positive relationship between the openness of information goods (for example, code, data, content, and standards) and information services (for example, services that offer information goods, such as the Internet, Wikipedia, OpenStreetMap and GPS) and the size and diversity of the community sharing them," according to the Gartner report. "From the viewpoint of enterprise information architects, this is known as the information-sharing network effect: the business value of a data asset increases the more widely and easily it is shared."

The primary method Gartner analysts recommend for companies wanting to share big data datasets or answers is the open API--a set of programming interfaces based on either an API set made available to customers of an enterprise application vendor, or a set of interfaces developed specifically to open source corporate data projects.

"The challenge for organizations is to determine how best to use APIs and how an open data strategy should align with business priorities," Newman said.

One additional tip about making money from big data as well as simply "getting smart," as Gartner's report puts it: Big data projects are difficult and expensive, so it makes sense to choose tools based on cost as well as functionality.

"Teams should use low-cost, open-source tools in early pilots to demonstrate the feasibility of big data projects," according to an April big data report, which predicted the best-practice habits established by enterprise architects could be the best investments for any company leaping into big data.

Open source tools tend to be less expensive upfront, have quicker and more ambitious roadmaps, and reflect (much more closely than other indicators) the real needs of the developers that use them, Boyarski said.

They also tend to preserve the value of both data and applications in the long run specifically because they don't trap either one in enterprise applications that rely on proprietary or hard-to-use APIs and data formats to keep users loyal to a single product set, according to Boyarski.

The community of open source users is much larger than Jaspersoft anticipated--more than a quarter million potential customers have downloaded various big data connectors from Jaspersoft's site.

Open source is also more secure and, in the long run, more useful because the people using the tools are often the ones building new features, reports, search algorithms, or other additions that enhance the value of the original software and make it possible to move apps, data, and reports out of kludged-together big data analytics frameworks and into brand new, high-function analytics, he said.

"After putting in all that work, the last thing you want is to have something trapped where you can't move it," Boyarski said.

InformationWeek is conducting a survey on big data. Take our InformationWeek 2013 Big Data Survey now. Survey ends Aug. 31.