News
News
11/21/2006
08:00 PM
Connect Directly
LinkedIn
Twitter
Google+
RSS
E-Mail
50%
50%

New Technology Seeks To Let Startups Build Their Own Googles

Open source search projects such as Hadoop, Lucene, and Nutch, combined with affordable, on-demand computing through Amazon Web Services, are putting scalable search infrastructure within the reach of most startups.

One of the first questions online startups typically face these days from potential investors is "Why couldn't Google build this?" Entrepreneurs are beginning to respond, "Why couldn't we build Google?"

The slow but steady maturation of open source search projects like Hadoop, Lucene, and Nutch, combined with the availability of affordable, on-demand computing through Amazon Web Services, suggest that scalable search infrastructure is well within the reach of most startups.

Hadoop is a framework for running applications on clusters of commodity hardware that duplicates the functions of the distributed Google File System and Google's MapReduce algorithm for processing large data sets. Lucene is a Java-based search and indexing system. Nutch expands on Lucene by adding Web-based crawling and additional search capabilities.

These open source search projects already are in use at companies and organizations such as Krugle, Powerset, Wikipedia, and Zimbra.

Krugle, a search engine for programmers that helps users find code and technical information online, is built on Nutch and Lucene. "It would have been impossible for us to create the capability that we have and go live in the speed that we did without Nutch and Lucene," says Krugle CEO Steve Larsen. "They were extremely important to us being able to solve the technical problems that we did in a short amount of time."

Access to the code also was important, says CTO Ken Krugler, "so we had the flexibility for the things that we needed for a vertical solution. The commercial solutions are much more restrictive. It's harder to tweak it and form it to what you need."

Krugle maintains about 100 servers at a collocation facility. Krugler says Amazon's Elastic Compute Cloud looks promising but he sees it more for companies that are just getting started. The cloud, also referred to as EC2, is simply virtual processing power than can be paid for as needed.

"It scales better than doing a co-host setup," says Krugler, though he still considers it too new to rely on. "Technically it ought to scale, but you just don't know."

Search startup Powerset is using EC2 to power its forthcoming natural language search site, apparently without any such reservations.

Previous
1 of 2
Next
Comment  | 
Print  | 
More Insights
IT's Reputation: What the Data Says
IT's Reputation: What the Data Says
InformationWeek's IT Perception Survey seeks to quantify how IT thinks it's doing versus how the business really views IT's performance in delivering services - and, more important, powering innovation. Our results suggest IT leaders should worry less about whether they're getting enough resources and more about the relationships they have with business unit peers.
Register for InformationWeek Newsletters
White Papers
Current Issue
InformationWeek Must Reads Oct. 21, 2014
InformationWeek's new Must Reads is a compendium of our best recent coverage of digital strategy. Learn why you should learn to embrace DevOps, how to avoid roadblocks for digital projects, what the five steps to API management are, and more.
Video
Slideshows
Twitter Feed
InformationWeek Radio
Archived InformationWeek Radio
A roundup of the top stories and trends on InformationWeek.com
Sponsored Live Streaming Video
Everything You've Been Told About Mobility Is Wrong
Attend this video symposium with Sean Wisdom, Global Director of Mobility Solutions, and learn about how you can harness powerful new products to mobilize your business potential.