In Part I, I gave an overview of Big Data, what it is, and what are the (ideal) skills (a lot!) you need to get started and named some of the key players. In this post, I will cover some of the most important technologies, some Big Data applications examples and the architecture which underlies the Hadoop/Big Data Framework.
A Historical Background
This paragraph should have been in the first post. However, it is worth mentioning for those who are like myself, people who want to know how “things came to be” as opposed to “they are just here so deal with it!!”
It all started at Google while searching for a solution in trying to index the web. In a very simplistic view, the engineers at Google aimed to put together a (Key, Value) pair of words and the URLs of the sites containing those words. It sounds simple on paper, however, once you consider the sheer volume of sites, the complexity because of the volume of data begin to crystalize.
Soon after Google successfully designed the two main components behind such an effort, the Google Distributed File System, and the MapReduce framework and moving moved past them so they are no longer a trade secret probably, they published a paper on the technology used to index the internet! The uniqueness of Google’s problem lies in the fact that the solution required horizontal scalability (i.e. many nodes doing smaller tasks) as opposed to the horizontal scale-up (i.e. more and more beefed up servers with more memory, etc.
Once Google let the proverbial cat out of the hat, two guys at Yahoo! Had a massive a-ha moment and developed Hadoop (Doug Cutting and Mike Cafarella in 2005, Wikipedia) who were working on a similar problem had a massive a-ha moment and developed Hadoop starting with the Hadoop Distributed File System (HDFS) and the MapReduce framework and started the implementation of what resembled the Google approach.
So, what are the Core Hadoop Components?
There are two main component at the code of Hadoop. Although there are many tools available, their purpose is to facilitate the interaction of those two component.
Hadoop Distributed File System (HDFS)
HDFS is not a file system like FAT or NTFS. It’s a layer which sits atop the traditional file system and handles loading, distributing and replication of files. What you need to know about HDFS is that when you want to add a node to your Hadoop cluster, you have to load HDFS so it can participate in the cluster operations. It’s like the File Allocation Table (FAT). HDFS splits up big Hadoop jobs and distributes them to all the Nodes in a duplicate fashion, i.e. a chunk of data would be sent to 2-3 nodes at the time. Why? Because Hadoop was designed for failure! Sounds strange? I am squeeze in a paragraph or so about Functional Programming where the Map function closely resembles function programming paradigm.
Primary and Secondary Name Nodes
As with every file systems, you need to know where stuff is located. HDFS implements this by designating a Hadoop server as a primary Name Node which keeps an inventory of where data is located such that if a node decides to take a vacation, other nodes will step in and replace it.
MapReduce is the essence of a Hadoop system. It is the business logic translating statistical models or execution plan to achieve what we had set out to do, answer questions. The logic is usually encapsulated in a JAVA application with a, you guessed it, a Map and Reduce function specifying what data types to expect, what to do with outliers and missing data, and so forth.
Here is a snapshot of what a Word Count Java application would look like. In a nutshell, the Mapper Tokenizes each word and counts it for a small chunk of the data. The Reducer takes the output of the Map which is many (Key, Value) pairs as (word, count) and aggregates the count. A part of the App is to configure the job to submit it to Hadoop which is something you can find out on your own.
Obviously businesses do not build such massive systems without ample justification of how it will help them in solving current problem (bottom and top line) and to plan for solution for mid to long term issues they face. They aim to use the massive parallel compute power at their disposal to answer questions about customer or system behavior using large amounts of structured, semi-Structured and unstructured data. Questions like a Cable or Mobile Phone Companies asking itself what are people not renewing their contracts (called churn rate)? Or a large Agricultural Co-op deciding on the amount of a certain fruit or vegetable to grow for the next season based on data other than the future contract prices from the stock exchange), or a local police force trying to understand why is the crime rate is high at a certain location and so forth.
So, Now What? What’s the Big Deal?
So now we have this great framework being used by Internet giants such as Facebook and so on, where people like you and me who work for consulting companies position Big Data and the benefits. The answer comes from one my favorite researchers/professors at MIT whom I studied his work with zeal while in Economics School. Erik Brynjolfsson, studied the performance of companies that excel at data-driven decision-making and compared it with the performance of other firms. And guess what the outcome was? They found that productivity levels were as much as 6 percent higher at such firms than at companies that did not emphasize using data to make decisions which did not surprise me at all.
So as a business executive, entrepreneur, consulting firm, or a software provider, there is a huge cost for doing nothing! Case closed!! I would really like to hear your thoughts on what you would like to see in Part III. Right now, I am planning to highlight some of the technologies which makes the user of Hadoop easier, especially for Data Analysts, however, I am open for ideas. Hope you enjoyed the post.