Monday, November 10, 2008

Yahoo's transforms data mining with open-source Hadoop

BEHIND Yahoo's push to open up web search and advertising is software powerful enough to sort through the entire US Library of Congress in less than half a minute.

The software, called Hadoop, is part of Yahoo's massive computing grid and is transforming the way Yahoo and corporate giants such as IBM extract meaning from enormous streams of data.
Universities are also using the code - an open-source version of software Google relies on for daily operation - to train a new generation of computer scientists and engineers.

"It makes it possible to actually take advantage of all the computers we have hooked together," Yahoo search and advertising sciences vice-president Larry Heck says.

Hadoop improves the relevance of ads Yahoo shows on the internet by analysing the company's endless flow of data - now more than 10TB daily - on the fly. As users click from Yahoo Mail to Yahoo Search to Yahoo Finance and back again, Hadoop helps figure out what ad, if any, is likely to catch someone's attention.

The key lies in mining insights from mind-boggling amounts of data. If a woman repeatedly reads reviews of sports utility vehicles, then clicks on automotive classifieds and then orders a book on helping a child adjust to kindergarten, she might be in the market for a family-size car, according to a Yahoo sales presentation.

As part of the push for more openness, Yahoo will be using the technology to boost ad sales not only on its own websites but on sites owned by the 796 members of a newspaper consortium working with the search giant to sell more ads at better prices.

In some ways, perhaps it is even more targeted than search advertising," says Leon Levitt, digital media vice-president for Cox Newspapers, a consortium member.

For Yahoo, an innovative approach to internet advertising is a major accomplishment. When Yahoo launched its Hadoop project in January 2006, it was selling search advertising for half of what Google charged and watched its share of internet searches dwindle.

Hadoop was first put to work building Yahoo's web index - the biggest computing problem inside Yahoo. Since then, a team of engineers has tuned the software, and researchers inside and outside of Yahoo have begun using it to experiment with giant data sets.

"All of a sudden, instead of waiting overnight, people could get the results of their experiments in a minute," says Doug Cutting, a work-at-home dad who hacked out the first version of Hadoop in his home in Sonoma County, California, as part of an open-source search project.

Read more..

No comments: