So you've done your homework, launched a prototype with Hadoop and found it good. Now the real fun starts.
Bringing a proof-of-concept project into production is only the beginning. Postproduction, Hadoop differs greatly from other information technologies. Deploy SAP or Salesforce, for example, and the transition typically means a shift into a lower-intensity "maintenance" mode, where less attention and fewer resources are required. With Hadoop, in contrast, delivery of the first production application is just the start of the journey. Trust me: Pressure will soon mount to develop new applications. And these new applications will require integration with new data sources. Your users will want to run more and more exploratory jobs.
In companies experiencing this kind of "success disaster" with Hadoop, keeping up with demand for expansion and new use cases often requires more effort than getting the initial application into production.
While there are many areas that IT managers must address to ensure the ongoing success of a Hadoop initiative, here are five challenges you should proactively address:
1. Keeping your software up to date: Hadoop is a rapidly evolving framework. Unfortunately, updating Hadoop software is challenging, especially on heavily used clusters. As a result, many people get stuck on a 3-year-old version and, before you know it, it's a huge effort to even think about upgrading. Although challenging, it's worth instituting a program of regular, incremental updates to the Hadoop software stack. To facilitate these updates, establish a frequent maintenance window for the cluster. Yes, the concept of a maintenance window feels retrograde to many IT organizations, but it's preferable to falling behind the fast-moving Hadoop ecosystem.
2. Scaling your cluster: Going from a half-rack to a full one brings one set of challenges; expanding from one rack to two brings different trials; going from two racks to four ... you get the idea. Each time you grow your cluster, there are new issues. Fortunately, Hadoop scales relatively easily, and it comes with built-in tools for common tasks like rebalancing disks. Still, the logistics of expanding the physical infrastructure can be thorny because, as your cluster grows, new tuning settings are required, and problems that didn't used to happen very often start to occur regularly (like failed disks). Critical Hadoop software services, such as your Name Node and Resource Manager, may need to be improved as well. Unfortunately, there's no silver bullet for addressing these problems. The best approach is to get ahead of the curve -- plan for expansion well before it becomes critical. One way to achieve this is to add a bit of capacity every quarter or even every month, on a regularly scheduled program.
3. Getting your security in order: In a successful Hadoop deployment, you'll find more and more users wanting access to the cluster and a corresponding demand for more and more data. You may soon outgrow the simple security and compliance mechanisms that were adequate in the early days and instead be pulled into a world of substantial complexity. Most Hadoop implementations start by using Hadoop's default security mechanisms, which provide no substantive user authentication. This may be OK initially, but over time you'll need to switch to the strong authentication provided by Kerberos. Most organizations wait too long to make this switch and instead tack up workaround measures that reduce productivity and will eventually need to be thrown away. That's a waste of time and effort. Instead, make the switch as soon as you can. Move early, "learn as you grow" with Kerberos, and don't waste time and productivity with workaround measures.
4. Supporting your users: The devil is in the details, and Hadoop has a lot of details. While Hadoop brings unprecedented power to the fingertips of your employees, it's a rather rough system to use, as you might expect from a system with its roots in the Wild West of Silicon Valley hackers. When a job fails, it can be difficult to tell if the problem is with a user's application code or in the database itself. Your developers and data scientists can waste valuable time trying to resolve arcane problems that have been solved already. Consider creating a user support system that encourages your community of developers, data scientists and Hadoop administrators to cooperatively help one another get past the rough edges of Hadoop, and take advantage of knowledge bases.
5. Keeping tabs on technology: The ecosystem surrounding Hadoop involves more than 15 open source projects, and that ecosystem is evolving rapidly. There's a constant flow of innovation, changes and updates that may impact productivity and ROI. Before deploying any new component, even for a quick evaluation, investigate its track record. Has it stayed current with the latest Hadoop release? Are there sufficient developers committed to the project? You need to be sure that slow-moving components don't prevent you from keeping your core Hadoop software updated.
Hadoop in its postproduction phase can be challenging. Its promiscuous nature means it has a powerful ability to tie disparate systems together and handle all kinds of data -- and that tends to make it a hub of activity for data scientists, software developers and system administrators. Paying attention to these five challenges will take you a good way toward ensuring that you can reap those benefits.
Tell us your tips and tricks for keeping Hadoop scaled, secure and up to date.
We welcome your comments on this topic on our social media channels, or [contact us directly] with questions about the site.