This might be a helpful post for those who are interested in the latest development of Big Data technologies, Hadoop being front and center among other tools and utilities.
For the first time in my Cloud experience, I can honestly say that Big Data and Hadoop are the killer apps for cloud computing. The clear delineation between do-it-yourself or get it via a Hadoop distro with a clear cost-benefit analysis is something I rarely found to be at such a granular level especially if you are looking to get training or do a quick proof-of-concept. My curiosity drove me to test-drive the five main players who had accessible technologies for eval purposes:
- Cloudera (a Yahoo! Spin off),
- Microsoft’s HDInsight and
- The final distro is a DYI one which I also experimented with straight from the Apache Hadoop standards site built on Centos and the latest Hadoop Distro.
There are major differences between 0.1.1 and 0.2.2 of Hadoop. There are also major differences in supporting tools based on your vendor of choice. Although this article is not a bake-off or a showdown, I will cover the good, bad and ugly and this might be split into a series of posts and will give the labor the fruits of my labor whether you are an IT Manager, Architect, a Developer, a Data Analyst or a Statistician.
What is Big Data?
I won’t spend much time explaining obvious things you can Google or YouTube. In very simple terms, Big Data is what your SQL Server cannot handle no matter how much disk space, CPUs and RAM you throw at it, and even if you could, it would be a very costly proposition. The KEY instead is scaling out versus scaling up so basically you scale out using commodity hardware instead of beefing up your SQL box with heaps of memory, cores, etc.
Why Scaling Out Versus Scaling Up?
Databases were designed for sequential read and write and not for Random read and writes. The reason being is that they were highly affected with the hardware architecture of storage devices which needles and spindles spinning fast enough to fetch the data we need. However, with the advent of SSD drives, this no longer applies (Uncle Bob has an excellent talk on this if you are interested). Therefore, the ubiquitous support for SSD drives, and by extension random reads and writes is no longer a factor in fetching data and the performance hits of the past!
Big Data comes in different flavors, structured, unstructured, dirty (as in missing elements), Volatile, and comes at you like water from a firehose. Think of Twitter or a Facebook or the thousands of Kindles which log every click, page flip, etc, as an example where data is constantly updated on a massive scale.
These characteristics are called the 4 V’s, Volume, Velocity, Veracity and Value of Data. Your job is to create order of the data chaos and you do so by a number of predefined algorithms, and new ones, to create such order.
So, how do you get a handle on such massive amounts of data? Enter MapReduce. MapReduce is a simple or complex algorithm with the primary function of “mapping” words into (Key, Value) pairs, sort and shuffle the output based, and aggregates on the Reduce routine. Confused? I thought so!
The best way to explain it is through an example, Word Count. Word Count is the “Hello World” example that Hadoop newbies have to do in order to test their installation, get an overview of how Hadoop works, and how Hadoop breaks down the problem in small batch jobs which are submitted, executed, and outputted.
An Even Simpler illustration of MapReduce
Imagine that you have a blackbox. You feed the box with a corpus of documents, say one of Shakespeare’s masterpieces. The Map portion takes the document and chunks it into manageable chunks, say a 1000 small pieces which make up the entire book. It then counts the words per chunk and outputs the files with the words and their count for each “chunk”.
Enter a Reduce
A Reducer takes the output of the Mapper and further aggregates the count of the words from each chunk, and repeats the process until we have just one document which has the count of each word in the document and Voila! We get a count for every word written in the book.
What Sort of Skills are needed to Get Started?
Big Data is a team sport. You have to have multiple skills on the team, and according to Rachel Schutt and Cathy O’Neil in their great book, Doing Data Science: Straight Talk from the Frontline, which lists multiple skills such as:
- Computer Science
- Statistics and Statistical Modeling
- Data Integration (I added this one since in most of my experiences you had to pull data from other sources including relational databases)
- Machine Learning
- Data Visualization
In Part II of the post, I will delve deeper into each distro and discuss the tools, pros and cons of each and finally where the future of this amazing technology is headed.