Open Source Programming: Open Source 9 big data technologies

Monday, June 25, 2012

Open Source 9 big data technologies

Big Data is booming these days, as more and more companies realize the benefit of storing data and leveraging it for useful insights. At the forefront of this Big Data revolution is Open Source technology, since majority of Big Data companies prefer it over closed source technology. Here are nine open source Big Data technologies that you should keep an eye on:

Apache Hadoop
Apache Hadoop was originally created by Dough Cutting in order to support his work on Nutch, which is an open source Web search engine. Hadoop is basically a MapReduce facility and distributed file system merged together, and was designed initially to meet Nutch’s multimachine processing requirements. The basic principle behind Hadoop is that it splices and distributes big data over a series of nodes running on commodity hardware.

R
Designed by Ross Ihaka and Robert Gentleman at the University of Auckland, NZ in 1993, R is an open source programming language that became the de facto standard for statistical analysis of very large data sets, as it is specially designed with statistical computing and visualization in mind.

Cascading
Cascading is an open source abstraction layer for Hadoop, that works as an alternative to MapReduce. Cascading allows the execution of data processing workflows using any JVM based language, with the goal of concealing the inherent complexity of MapReduce jobs, in order to make it easier for people who don’t need or don’t want to bother with the nitty gritty of log file analysis, bioinformatics, machine learning, and other MapReduce jobs.

Scribe
Developed and released last 2008 by social media giant Facebook, it was designed to aggregate log data that is streamed in real time from a large number of servers. The original purpose was to handle Facebook’s own scaling problems. So far, Scribe has been successful and is currently handling tens of billions of messages a day.

ElasticSearch
ElasticSearch is an open source search server developed by Shay Bannon and based on Apache Lucene. ElasticSearch’s main selling point is that it doesn’t require a special configuration and is perfectly scalable while still supporting near real-time search and multitenancy. It is currently used by a number of high profile companies, particularly Mozilla and StumbleUpon.

Apache Hbase
Designed to run on top of Hadoop’s Distributed Filesystem, Apache Hbase is an open source, non-relational columnar distributed database that is modeled after Google’s BigTable. Hbase’ most notable user is Facebook, which adopted the platform last 2010 for use in its messaging service.

Apache Cassandra
Another one of Facebook’s aces, Apache Cassandra was originally developed as a NoSQL data storage solution that will power the social network’s Inbox Search Feature. Facebook has since abandoned Cassandra in favor of Hbase, but it is still being used by a number of high profile companies such as Netflix, particularly as a back end DB for their streaming services. Cassandrai s currently available under the Apache License 2.0.

MongoDB
MongoDB is a popular open source NoSQL data store that uses structured data in JSON-like documents using a dynamic schemas called Binary JSON. Created by the founders of DoubleClick, MongoDB is currently used by several large enterprises such as Craigslist, Disney Interactive Media Group, Etsy, The New York Times, and MTV Networks.

Apache CouchDB
Yet another open sourche NoSQL DB, CouchDB uses a blend of JSON, Javascript, MapREduce, and HTTP to store and query data. The platform was originally created in 2005 by former IBM developer Damien Katz as a storage protocol for large scale objects. One of CouchDB’s more popular users is The British Broadcasting Corporation, which uses it for their dynamic content platforms.

Monday, June 25, 2012

Open Source 9 big data technologies

No comments: