7 Keys To Building A Successful Big Data Infrastructure
The infrastructure you build for big data, whether you're looking at software or hardware, will have a huge impact on the analysis and action your big data systems will support. Here are 7 factors that can make a big difference when building your big data architecture.
![](https://eu-images.contentstack.com/v3/assets/blt69509c9116440be8/bltb7e983332f41f154/64cb423d28937271cb61d924/Big-Data_Maxiphoto_iStock_84982463_MEDIUM.png?width=700&auto=webp&quality=80&disable=upscale)
Big data is a big part of many enterprise IT operations today. According to an IDC forecast, it will be $187 billion big by 2019. It's a critical part of the analysis that forms the basis of both machine and human business intelligence and decision-making. Since it's obvious that you can't have any sort of data -- big, little, or exactly right -- without some sort of infrastructure, it's worth taking a look at some of the factors that go into building a successful big data architecture.
I decided to take a look at seven factors that can make a big difference in the effectiveness of your big data infrastructure. Some might seem obvious, while others are a bit more subtle. In practice, all will work together to have a huge impact on the analysis and action your big data systems will support.
[See 9 Hot Big Data and Analytics Startups to Watch.]
It's not that these seven factors are the only things that have an impact on the way your big data infrastructure will work. Making big data work for an enterprise is complex.
There are scores, if not hundreds, of bits and pieces that go into a big data system -- any one of which can end up having a large impact on the work data scientists can do. But these seven deserve your consideration because they underlie so many other pieces and processes.
At this point, it's likely that you're involved with big data, even if you work in a small company. That's part of the power of the infrastructure pieces now available -- many of them are accessible to even the smallest IT operations.
With that accessibility comes the possibility of confusion and frustration for those smaller staffs that might not have data science expertise on board. If you're in that position, this list won't relieve all your confusion, but it might provide a place to start asking some pointed questions of potential service providers and suppliers.
If you're involved in a big data project, I'd love to hear from you regarding the infrastructure choices you've made. What do you think of this list? Is there something you'd swap in, or should the entire list be tossed out and started again from scratch? I'll be hanging out in the comments to see what you have to say.
In casual conversation, big data and Hadoop are often used almost interchangeably. That's unfortunate because big data is much more than Hadoop. Hadoop is a file system (not a database) that's designed to spread data across hundreds or thousands of processing nodes. It is used in a lot of big data applications because, as a file system, it's great at dealing with data that isn't structured -- that doesn't even look like the data around it. Of course, some big data is structured, and for that you'll want a database. But that's a different item on the list.
Ahh, the database for the structured part of your big data world. This can get a little confusing, so hang on. If you want to bring some order to your Hadoop data platform, then Hive can be the ticket. It's an infrastructure tool that allows you to do SQL-like things to the very un-SQL Hadoop.
If you have some part of your data that easily fits inside a structured database, then Impala is a database designed to live within Hadoop -- and it also makes use of Hive commands you might have developed on the journey from Hadoop to SQL. All three of these (Hadoop, Hive, and Impala) are Apache projects, so they're open source. Have fun.
So far, we've been talking about storing and organizing data. But what about when you want to actually do something with the data? That's when you need an analytical and processing engine like Spark. Spark is yet another Apache project, and here it stands in for a bunch of open source and commercial products that will take the data you shoved into your lakes, warehouses, and databases and do something useful with it.
Spark can be used on all kinds of data stored in all sorts of places because of the libraries that give it access to almost anything you can imagine. Once again, it's open source, so you're free to modify it to your heart's content.
Lots and lots of people know how to build SQL databases and write SQL queries. That expertise doesn't have to go to waste when the playing field moves to big data. Presto is an open source SQL query engine that allows data scientists to use SQL queries to interrogate databases that live in everything from Hive to proprietary commercial database management systems. It's used by little companies like Facebook for interactive queries, and that phrase is key. Think of Presto as a tool for doing ad hoc, interactive queries on enormous data sets.
There are some tasks within big data that involve rapidly changing data. Sometimes, this is data that is being added to on a regular basis, and sometimes it's data that is changed through the analysis. In either case, if your data is being written as often as it's being read, then you want that data on-premises and online. If you can afford it, you want it on solid-state storage, too, because that will speed things considerably -- a non-trivial consideration when you have people on a retail or trading floor tapping their feet while waiting for answers to come back.
When the analysis is taking place on larger, aggregated databases for which you're building big, batch-oriented routines, then the cloud can be perfect. Aggregate and transfer the data to the cloud, run the analysis, and then tear down the instance. It's exactly the sort of elastic demand response the cloud does so well. The operations won't be affected significantly by any latency issues the internet might introduce. When you combine the real-time analysis that takes place on dedicated on-premises systems with deep analytical runs in the cloud, you're getting close to realizing the full potential of a big data infrastructure.
It's one thing to analyze big data. It's quite another to present the analysis in a way that makes sense to most human beings. The picture can help quite a lot with the whole "making sense" thing, and so data visualization should be considered a critical part of your big data infrastructure.
Fortunately, there are a lot of ways to make great images happen, from JavaScript libraries, to commercial visualization packages, to online services. What's the most important point? Pick a handful of these, try them, and let your users try them. You'll find that solid visualization is the best way to make your big data analysis as valuable as possible.
There they are -- seven things you should know and keep in mind as you work with big data in your organization. I'll look forward to hearing your key points -- and seeing your suggestions on the tools, processes, and services that no big data user should have to do without.
It's one thing to analyze big data. It's quite another to present the analysis in a way that makes sense to most human beings. The picture can help quite a lot with the whole "making sense" thing, and so data visualization should be considered a critical part of your big data infrastructure.
Fortunately, there are a lot of ways to make great images happen, from JavaScript libraries, to commercial visualization packages, to online services. What's the most important point? Pick a handful of these, try them, and let your users try them. You'll find that solid visualization is the best way to make your big data analysis as valuable as possible.
There they are -- seven things you should know and keep in mind as you work with big data in your organization. I'll look forward to hearing your key points -- and seeing your suggestions on the tools, processes, and services that no big data user should have to do without.
-
About the Author(s)
You May Also Like