If you want to get your hands on an open source version of some of Google's core technologies, maybe you should ask Yahoo.
Yahoo has emerged as one of a major sponsor of Hadoop, an open source project that aims to replicate Google's techniques for storing and processing large amounts of data distributed across hundreds or thousands of commodity PCs (see Baseline's report: How Google Works). Last year, Hadoop project founder Doug Cutting became a Yahoo employee, and at July's Oscon open source conference he and Yahoo's director of grid computing Eric Baldeschwieler detailed how they are applying the technology.
Cutting, formerly of Excite and Xerox PARC, has founded or co-founded a series of projects related to creating an open source platform for search under the banner of the Apache Software Foundation. His work on Lucene (a Java software library for Web indexing and search) and Nutch (a search engine application that builds on Lucene) led to Hadoop, which started as a Nutch sub-project aimed at efficiently spreading the workload for compiling a search index across multiple computers. Since he doesn't work in a Yahoo office, Cutting says his employment is really more like being paid a salary to work full-time on his Apache projects and help Yahoo work efficiently with the open source community. On the other hand, he does work with Yahoo to get the most out of the technology.
The basic technique Hadoop uses is part of what has allowed Google to manage the massive data processing challenges associated with indexing the Web—and do it economically. Google has not released source code for its Google File System or the associated distributed computing environment, known as MapReduce. But what Google has done is publish academic papers on the computer science behind both—presumably knowing full well that competitors and open source programmers would be likely to create their own implementations.
In addition to giving a presentation on Hadoop at Oscon, Cutting participated in a panel discussion on new system programming and architecture techniques moderated by O'Reilly Media CEO Tim O'Reilly. While Cutting declined to speculate on Yahoo's motives for backing the project, O'Reilly called it an example of open source being "the natural ally of the number two player" in a market and a way of leveling the playing field.
In a follow-up blog post, O'Reilly wrote that Yahoo evidently wanted to make this a "coming out party" showcasing its backing of the project. "In fact, I even had a call from David Filo to make sure I knew that the support is coming from the top," he wrote. (While his co-founder Jerry Yang is better known as the public face of Yahoo, Filo is the geekier of the two and has always played a strong behind-the-scenes role in the company's technology decisions.) O'Reilly thinks Yahoo is trying to give itself "geek cred" by reaching out to the open source community with projects like Hadoop and its Yahoo Hack Day events.
Of course, Google has its own outreach programs, which it uses to cement its reputation as a technology leader and boost recruiting. One reason Google gives for not releasing source code for things like its distributed file system is that the software is too deeply intertwined with other components of its operational systems and can't be easily separated out. That's the story Google representatives repeated at Oscon when an audience member asked why they hadn't open sourced more of the software they use to manage their data centers.
For his part, Cutting downplays the idea that Yahoo is using the Hadoop project as some sort of competitive weapon. "While we do compete, we don't compete over this stuff," he says.
Google hasn't explicitly encouraged the development of Hadoop or provided clues about how to produce a MapReduce system, Cutting says. On the other hand, he notes, there's actually a Google-sponsored course at the University of Washington that uses Hadoop to give students hands-on experience using MapReduce for distributed computing.
The Hadoop style of distributed computing is mostly good for batch-oriented analysis of unstructured data (such as compiling an index of the Web), rather than interactive applications (providing an immediate answer to a query), Cutting says. However, yet another Lucene project spin-off called HBase is in the process of trying to replicate Google's BigTable. BigTable is another technology Google has described publicly, a database management system for structured and semi-structured information that builds on the Google File System and MapReduce and uses that structure to provide more interactive answers to queries across very large data sets.
But the HBase effort is still in an early, pre-alpha stage of development, and most of what you can do with Hadoop is inherently batch oriented—aimed at shrinking the time required to perform an analysis from days to hours, but not for delivering an answer within seconds. "If you need to make changes and see them in real time, Hadoop is not the answer," Cutting says. "What it's really great for is just munging through tons of data."
Hadoop includes a version of the distributed file system originally created for Nutch along with a version of MapReduce, both written in Java. As in Google's MapReduce, the Hadoop version automates the division of computer-intensive tasks into smaller sub-tasks that are assigned to individual computers in a cluster. Each computation is broken into two stages: the "Map," which produces an intermediate set of results, and the "Reduce" function, usually devoted to sorting and aggregating data to produce a final result. In the context of compiling a search index, the Map phase would involve thousands of computers each assigned the task of indexing a subset of the Web crawl data, and the Reduce phase would be sorting and merging those results into the final index.
"It's a very simple programming metaphor, where people can catch on quickly and start using it," Cutting says. "Your first program can be something that can be expressed on a page and does something useful." Those with a Unix background may find the MapReduce technique to be a little bit like using "pipes," a technique for chaining programs together by having the output from one program fed as input into the next.
How Hadoop Works
The Hadoop runtime environment takes into account the fact that when computing jobs are spread across hundreds or thousands of relatively cheap computers, some of those computers are likely to fail in mid-task. So one of the main things Hadoop tries to automate is the process for detecting and correcting for those failures.
A master server within the grid of computers tracks the handoffs of tasks from one computer to another and reassigns tasks, if necessary, when any one of those computers locks up or fails. The same task can also be assigned to multiple computers, with the one that finishes first contributing to the final result (while the computations produced by the laggards get thrown away). This technique turns out to be a good match for massive data analysis challenges like producing an index of the entire Web.
So far, at least, this style of distributed computing is not as central to Yahoo's day-to-day operations as it is said to be at Google. For example, Hadoop has not been integrated into the process for indexing the Web crawl data that feeds the Yahoo search engine—although "that would be the idea" in the long run, Cutting says.
However, Yahoo is analyzing that same Web crawl data and other log files with Hadoop for other purposes, such as market research and product planning.
Where Hadoop comes into play is for ad-hoc analysis of data—answering a question that wasn't necessarily anticipated when the data gathering system was designed. For example, instead of looking for keywords and links, a market researcher might want to comb through the Web crawl data to see how many sites include a Flickr "badge"—the snippet of code used to display thumbnails of recent images posted to the photo sharing service.
From its first experiments with 20-node clusters, Yahoo has tested the system with as many as 2,000 computers working in tandem. Overall, Yahoo has about 10,000 computers running Hadoop, and the largest cluster in production use is 1,600 machines.
"We're confident at this point that we can get fairly linear scaling to several thousand nodes," Baldeschwieler says. "We ran about 10,000 jobs last week. Now, a good number of those come from a small group of people who run a job every minute. But we do have several hundred users."
Although Yahoo had previously created its own systems for distributing work across a grid of computers for specific applications, Hadoop has given Yahoo a generally useful framework for this type of computing, Baldeschwieler says. And while there is nothing simple about running these large grids, Hadoop helps simplify some of the hardest problems.
By itself, Hadoop does nothing to enhance Yahoo's reputation as a technology innovator, since by definition this project is focused on replicating techniques pioneered at Google. But Cutting says that's beside the point. "What open source tends to be most useful for is giving us commodity systems, as opposed to special sauce systems," he says. "And besides, I'm sure we're doing it differently."