How the Azure Analytics (R)evolution will Alter the Hard Truth about Data Science

If you are in technical sales or trying to break into Big Data or Data Science this post is definitely a must-read for you. It is based on real-life examples, research and analysis from a practitioner stand-point..

There is a sad but realistic truth about Data Science (DS henceforth) and Big Data (BD). There is an unfortunate dichotomy in this space:  Those who are on the inside track (the Insiders), and those who are trying to get in (theOutsiders).  In the world of the “Insiders” there exists a persistent message which garnered support amongst employers which amounts to touting the Outsiders and raising the entry barrier. How? By raising the number of skills required. I came up with a list of must-have skills for a BD or a DS role based on my own analytics exercise on hundreds of DS/BD job descriptions I gathered from a web scraping. So guess how many skills are required? Twenty two! Setting the bar at least at twenty two skills, and counting, is ridiculously unrealistic!

 Enter Microsoft!!

The will-be DS/BD game changer happened quietly with not much coverage in the tech media world. This was Microsoft’s acquisition of Revolution Analytics. So, how does the Microsoft acquisition have to do with Outsiders wanting to get into Data Science? A Lot!! Let me explain.

The one fundamental skill which is a must-have in DS is Statistics! And what does the DS community use as a Statistics programming language? R!.Here is where the story gets interesting.

A branch of DS is Machine Learning or Predictive Analysis. This is the true value-add for any BD initiative: to be able to make a data-driven decision about future strategy.

To do this in R, the very popular Analytics language, you need to go through training, etc, become an expert before you can attempt to get close to solving problems. R is archaic. It is a reincarnation of languages called S and S-Plus and it is 40-years old.

The great news is that Microsoft just acquired Revolution Analytics (hence the post title). Revolution R (a product of Revolution Analytics) is known for enabling parallel processing of R resulting in massive performance gains.

Today, R is supported in Azure Machine Learning in its current format. So why does this matter? The reason is very simple. There is talk that Microsoft will probably take R, modernize it and make it a first class language, add it to Visual Studio with Intellisense support and allow everyone to develop Azure ML solution in Visual Studio ready to  be published directly to Azure. The net effect:

  •  It will be nothing short of a coup for the Outsiders in the DS/BD field who can now develop solutions without the need for the massive R learning curve cost.
  • Azure ML already provides a lot of the R functionality today out-of-the-box so it’s a natural extension to the existing functionality
  • Azure ML makes it very simple to share code, models, and common problem solutions thus:
  • I will get my solutions at 10X time-to-market
  • I can profit from my work if I choose to publish it in the Azure Gallery
  • It will change the market dynamics as it will increase the short supply of talent of BD/DS
  • And will unleash the genius of statisticians, business users with minimal programming experience to conduct their own experiments.
  • Coupled with HDInsight which has access to Analytics APIs, this is the more efficient BD solution on the market today


What’s more: Keep an eye out on the space. As I mentioned, if you are in Microsoft technical sales, this will be your ticket to moving large enterprises onto Azure with a strategy rather than just a plain tactical need!! If you are evaluating DS technology today, you probably know very little about Azure ML and the functionality provided. I would highly encourage to evaluate Azure ML before heading to a Cloudera or MapR or any other Hadoop vendors. You will not be disappointed!


About Me:

I am a freelance consultant with over 24 years of experience in IT, strategy and Economics. I specialize in Cloud, Data Architecture and DS, Machine Learning, corporate strategy and provide architectural consulting, training, technical research, data-driven decision making solutions based on economics-based statistical methods grounded in scientific frameworks.

The 22 Skills List of Data Science:

  1. R Programming
  2. Getting and Cleaning Data
  3. Exploratory Data Analysis
  4. Reproducible Research
  5. Statistical Inference
  6. Regression Models
  7. Practical Machine Learning
  8. Developing Data Products
  9. Data Visualization
  10. DBA
  11. Hadoop (including Azure HD Insight technology stack
  12. Orchestrate data workflows
  13. Data ingestion/curation using Pig, Hive, Sqoop or other Hadoop tools
  14. Hadoop cluster configuration using Hadoop big data architecture
  15. High-level design using Business Analysis, Microsoft Azure Platform Knowledge, Blob Storage API Knowledge
  16. Blob Storage API Knowledge
  17. Metadata management tool.
  18. Model client data
  19. Mapping
  20. Data profiling – Information analyzer/Excel preferred
  21. Decide how data is going to be used to make decisions, and
  22. Knowledge of both tools and methods from statistics, machine learning, software engineering, as well as being human and show persistence

The Silent Ascension of Microsoft Azure Cloud

free-photo-cloud-957If you are developing, architecting or migrating on-prem apps to Azure, you really need to read this article unless you are follow Microsoft tech news!

For the past few months, Microsoft has released over 60+ features, enhancements, and services to its flagship Windows Azure Cloud offering and continues to do so. Recently, I was asked by a client who is in the midst of the cloud-inzation of their applications due mostly to aging hardware and costly VPN network gear common in IT shops today to deliver a workshop on Azure App Service. One of the feature which I focused on was the App Service. From the sound of it, a new App Service garners the perception is that it’s another Azure service, in other words a SaaS offering, however, that is not even close. Furthermore, all websites hosted in Azure will be or are converted into an App Service so you really need to know more about it.

So, what is NOT an App Service?

Just to clear up any misunderstanding, here is a partial list of what is NOT an App Service:  A stand-alone website. A Software-as-a-Service offering. A custom Virtual Machine you can RDP in to manage. A bunch of APIs for you App. An IIS-management portal for your App.

What IS an App Service?

An App Service is a Platform-as-a-Service and a very sophisticated one in terms of abstracting the complexities of hardware and OS for the developer and allows for easy-to-set performance variables. The App Service three main components:

  1. A Service Plan, and
  2. A Resource Group
  3. An App

A Service Plan: This used to be called the Website Hosting Plan. A Service Plan is comprised of the Compute Tier which is either Basic, Standard or Premium. They come pre-configured, you just need to specify the instance count. This allows you to scale as per the load you believe your app will handle. Auto-scale ready for you to configure. The key here, and the most important takeaway, is that you can have Service Plans for Production, Test, Dev, etc, and can move the app in and out of these Service Plans!

A Resource Group: This is where you will have all of your resources. If you need a database for example, you will create one and assign it to the resource group your app is in. Remember that you have to have your resources in the same Geo so you don’t incur extra charges. You need to have your RG ready before creating the App.

The App: This is the application which could one of the following or all of the following, a Web, Mobile, API and Logic Apps can be moved in and out of Resource Groups but have to be assigned to a Service Plan and can co-exist in the same plan and share the same scale you set at the start.

I hope you enjoyed the article and please feel free to email me if you have any questions.

 About Me:

I have over 22 years of experience in IT and a recent Masters Degree in Business Economics. My involvement in cloud computing dates back to 2005 when I was a PM for BizTalk Server and the group at the time planning and strategizing about how and what functionality to cloud-enable. The first incarnation of such functionality was the Service Bus offering. That was 10 years ago. Today’s BizTalk Services are the core to developing BizTalk application spanning multiple geographies and fully take advantage of the APIs made possible for consumption by the other Azure resources.

The Big Data Posts: The Last Word….and The Fun Ones!!

Sorry, but I couldn’t let go =). Seriously though, I have some more information I felt that I have to share with my friends at LinkedIn. Information which are more like “Lessons’ Learned” on what NOT to do and what NOT to expect with Hadoop and Big Data on Day 1 forward!

Hello World!!

What is Hadoop’s equivalent of “Hello World”? The answer is Word Count. Why so? Because Word Count (where I showed a code shot of a MapReduce Word Count in Part II the code is supported by every vendor. In the Hello World Tutorial, you will download a book from project Gutenberg or any copyright-expired books from Google Books (Cloudera comes with Shakespeare’s Romeo and Juliet in .TXT format) then you will “Load” the file into HDFS and Hadoop will spit out each word in the book and its count. One of my favorites in an inverted index example which mimic a search engine, the results give you the word and the name of the books it’s in or the count of the books it is in. NOTE: The OS file system is completely different from the Hadoop File System, HDFS.

Now onto the Fun Stuff

In Part III, I answered the question: “What do developers do after shipping ANY software product?”  The answer was: “They develop the interfaces and tools to make it usable!!” Why? Because Version 1 is almost always about entering the market. All the free education training resources? They are debug tools, workarounds, and persuasion to try and keep those developers around.

I once heard Bill Gates say “We [Microsoft] excel at abstracting complexity. Take any task in Excel for example, go back 2-3 years, and compare the time difference it takes to achieve the same results!”. The statement stuck to this day because it is very important. It’s about Buy-versus-Build, in-house IT specialization decisions, IT investment choices, and so on. We continue to debate all of these arguments to this date.

And Hadoop is no different and will not be!! There is a race going on right now of who can provide a better abstraction or more specialized-purpose utility layer on top of Hadoop MapReduce, unstructured data, HDFS, all through utilities such as Hive, HBase, Pig, Flume, Sqoop, Ambari, Kafka. Mahout, YARN, Oozie, Sqoop, Storm, ZooKeeper, Phoenix just to mention a few. You can details or summaries of each at Http:// Now, the reasons for such progress which is ironic when progress is to make something work with ease or work, period. The reasons for the Progress:

Reason # 1 – Math, Statistics, and Statistical Modeling

Software providers know how difficult it is for the lay programmer who probably graduated with an Art History or Psychology Major (who permeated the software industry long ago) cannot write a Logistic Regression Model, or build a model from scratch to save his life. The results: Build him/her a utility. Remember VB?! The OO community was on the brink of rioting at 1 Microsoft Way, Redmond, WA for such a product which had nothing to do with OO (well, a little bit), however, it did take off. When I asked my professor in college about it (he was a brainiac, one of the originals on IBM’s Big Blue Chess program), he answered, my grandmother can write code, but she cannot optimize it, so stick to the optimization part!!

Reason # 2 – MapReduce is just plain difficult and alienates many data analysts with invaluable domain knowledge

This is where companies are already using technologies like Pig and Hive, Impala, and others software’s purpose is to eliminate the program having to transfer the math solution into a Java MapReduce job! The problem: Good luck debugging the generated code, in addition to the more dependence (vendor lock-in) on the major players in support contracts so bye-bye open source free software you touted to your boss, hello maintenance contracts, and time to revamp the resume! JK.. (I did not mention products such as Datameer or RedPoint or LigaData, and the list is very long)

Reason # 3 – It’s so Buggy, You Have to Start in the Cloud

This is a no-brainer. Believe me. Don’t even think about building an on-premise 100 node cluster for testing out the need for Big Data at your company. You might extend from the cloud to on-premise but your strategy should be “Cloud-it-and-forget-it” to start with.

Honestly, Big Data this is the first application I have seen where your mode of execution should be Cloud-First and everything else later. So, if you someone asks you about your cloud strategy, now you can pretend that you actually have one!! Why pushing here? It’s because I can provision a Hortonworks HDInsight on Hadoop

Finally, software choices: I mentioned the vendors in Part III, but I will do it so I won’t lose you, the Reader,

    1. Microsoft’s HDInsight:
      • However, the Cloud version is a bliss!! Few minutes to create the Azure Storage Service, (free eBook here) and a few more to create the cluster with Hortonworks Hadoop running in a Linux VM which you interact with by using PowerShell. You NEED to learn PowerShell especially at the release of Windows 10, CMD.Com will no longer be supported!  Back to HDInsight: With Cloud Storage as a Service, you can spin up a Hortonworks image, where HDFS is actually Azure Blob storage, run your scenario, move it back to Storage you created earlier, and spin the Hortonworks cluster down. Total Cost: very little because all will work and if it doesn’t you have someone (Microsoft) to yell at. Cloudera offers similar functionality (discussed later). One more KA feature: There is an Emulator provided by Microsoft which you can try locally before shipping it off to Azure. IF you can install it. Seriously, let me know if it installs for you!!!
      • If you try the Hortonworks Data Platform installation Sandbox (VM), you get an error in a log file in a directory that does not exists so you don’t actually know what happened. But wait, there is more: You can install the Hortonworks software in a VM running Windows which will run a lot of JVM processes for each feature with APIs made available to Windows and the utilities supplied by Microsoft, one of which is Data Visualization, a very important aspect of any Hadoop installation. How is this done? Just add in PowerPivot, PowerQuery, any SQL BI utility, and install the Hive ODBC driver and you can connect to the data directly and run some awesome visualizations long after the Hadoop cluster is gone.
      • This is how Hortonworks and Microsoft bring back the alienated Data Analysts crowd. Come on: How many Analysts do you know who does not know Excel?!! Others do too, however, the tools are not Office!!

2 Cloudera:  Provides four flavors, straight-up cloud, KVM, VMWARE, and VirtualBox. It will install smoothly (The VirtualBox that I checked out), and the samples will work *Sometimes*! however, the ones which don’t work, TAKE my advice, leave them alone. You can debug all you want, but there are so many moving parts that unless you luck out, otherwise. Keep. On. Trucking. Since I have a KA Windows Server, I was able to convert the VirtualBox image to Hyper-V and it worked fine. I was actually able to do the same for all vendors.

3.Hortonworks: Also provides the same flavors as Cloudera, however, as I mentioned earlier, go with the cloud option if you can, but it’s Azure for the Microsoft-haters out there!!. Azure gives you a month free trial. I used the Hyper-V image because I have a KA Windows server. The ugly thing is that, you spin up the image, and you connect to it via a browser or SSH. Not the prettiest UI. I also tried to get the Windows Data Platform to install and work. After many and I mean many hours of trying, nothing worked. Cryptic error messages, bad version control, HDP-2.0.6, 2.2.0,, and so many more! I have to point out that there is a Hortonworks for Azure VM Template which you “Quick Create” as mentioned above, and an HDInsight Hadoop on Linux  (cloud and Sandbox with an SSH or browser access” access), and a Hortonworks Data Platform MSI package for Windows which will install on a Windows Server but the prerequisites are too much and if something break, you have nowhere to look first. This is the Windows distro can be installed directly on a Windows VM Server box as well and is available from Azure.

4. .MapR,  NOTHING worked for me. Period. I tried debugging but felt like there was no critical mass out there to answer questions. I ran out of time. I just moved on…. I do need to mention that recently MapR has joined forces with Amazon and now offers Hadoop VM Service on Amazon and they offer a “decent” virtual academy lessons. Unfortunately, the free videos are very limited.

5.Amazon EMR: It Costs money to try! Unlike Microsoft and the other vendors who provide you with a “Sandbox” Amazon does not and with my super low budget, I did not try it. I will try it soon enough though. Amazon is the pioneer of Cloud Services and a very innovative company wo want to rent out the massive data and compute they have plenty of. It’s platform independent and offers everything possible. EMR and the entire Hadoop functionality is available as a set of APIs. Developers Rule!!

6.Finally, the final option is a DYI one which I also experimented with straight from the Apache Hadoop site built on CentOS and the latest Hadoop Distro. It was a nightmare. If anything breaks good luck debugging it or finding the source of the problem. Remember this is open source and no one is obligated to document anything if they don’t feel like it that day!! You are at the mercy of history (i.e. someone had the problem, solved and shared it), or someone in the community who will jump in to help you.

For any project plan you come up with for a Big Data project, in your project plan add a resource, and name him Google or Bing based on the flavor of your implantation and access to support, and assign him tasks such as research, Books, Professor, because he will save your life.

The Last Word:;

Remember the graph I drew up with the required skills for Data Scientist? Well, I made a few modification and thought I’d share it with you, my friends on LinkedIn so if you are starting out, you know which are the important areas to start with first (usually the hard ones)..

Big Data Skills





Bash Badawi has over 24 years of software development experience and is currently actively looking for a home to contribute his knowledge. Mr. Badawi can be reached on his LinkedIn profile or over email at He welcomes all feedback, questions, request for consultations, as he is an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it quickly – Part III

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part III

Well, sadly, we have come to the end of the topic which I have tremendously enjoyed (unless of course I change my mind and can think of something compelling to do a part 4). In part I and part II, I discussed what is Big Data, what you needed to know among other things. In Part III, I will start the post with a questions: What do software engineers/communities/companies do after they ship software? I say Engineers/Communities/companies since the lines are so blurred now we don’t know who is who!, you know, open source, services-only, countries backing open source initiatives, etc). I will provide the answer at the end of the post for a suspenseful effect.

The purpose of this post is that I will tell you what else is out there, the ugly stuff to expect and how to get started:

  1. Expectation: Don’t expect a lot!! Meaning: DON’T EXPECT THINGS TO WORK right out-of-the-box. You will run into every conceivable problem you can think of whether it is a built-in feature or add-on
  2. How to get started (it’s very hard, you’ll see what I mean). The first thing is Cloudera was probably the least problematic and some features actually worked (after some long hours of debugging). Hortonworks was my number 2. The cool thing about Hortonworks is that they offer a Linux and a Windows version. (I think Cloudera is about to offer the same) The Hortonworks Windows version, called Azure HDInsight, uses Blob storage on Azure instead of HDFS (pretty smart from Microsoft to keep Windows relevant in the space). Here is a link to the Hortonworks Architecture image below for both stacks.


Why is the Microsoft Platform so important? IMHO, because data visualization is an important part of the overall Hadoop skillset. Presenting the information might make or break the project. So why is Microsoft Office so important here, it’s because has some powerful data visualization tools (Power Pivot, Power Query, Excel) which will ensure the buy-in and inclusion of the business users and most importantly, the data analysts who like to not just view data but also run queries over Hive ODBC.

Additionally, and there is a huge plus for the Microsoft-Hortonworks partnership, it’s the tight integration of Microsoft Center with Hortonworks’s Hadoop as per the image below (think of hundreds of nodes), link here. Other vendors offer a form of monitoring, however, the UI is nothing like System Center.


There are other players in the field I did not mention such as:


Image by way of and as well.

As you can see, the market place is way overpopulated which means it is ripe for acquisitions/consolidation. I bet you if I republish this post a year from now, the picture above will be just a hand-full of market agents.


Figure 1 Image is Gartner Research Hype Cycle for Big Data 2014 by way of Link is here

Additionally, based on the Gartner Hype Cycle, I do believe we are somewhere between the Peak and Climbing the Slope. It depends on which Sector you examine (e.g. telco is ahead of transportation)

You might ask, how and where do I start? I need to acquire those skills fast to cash-in on this lucrative trend! My suggestion is to start with major Big Data vendors I mentioned in Part I of the series. They all have Sandboxes, or Cloud images for training purposes.

Therefore, start with the Virtual Machines!, Cloudera and the Cloudera Sandbox, Hortonworks on—premise Sandbox, Hortonworks Azure HDInsight cloud startup is a snap to get up and running Amazon EMR (Elastic MapReduce) , MapR, or Do-it-Yourself (DIY) for the brave-hearted (Udemy & Coursera have good video courses on how to get started from scratch. Good luck mate!)

I have had major pains in all of them. They were so buggy that I deleted and started over for each one of them.

I recommend that you start with a Cloud-based Sandbox in the order mentioned above then try the On-prem. Now, you can build your own, from scratch, but unless you are a Linux- Java-I-can-solve-everything guy, DON’T! You will be wasting lots of time doing something someone else did, i.e. built you a nice VirtualBox or VMWARE or Hyper-V VM and debugged it all. That doesn’t mean you won’t have to debug things yourself. Believe me, you will run into a problem somewhere and have to debug the “ready” image or cloud. If you can recreate the cluster on the cloud, just do it.

The answer to the question: What do software engineers/communities/companies do after they ship software? They make it work by starting minor releases which did not make the cut, add tools they used to debug the build itself, and add tools to make it easier for the user to manage and use the system. Tools such as: Hive, Pig, Impala, Sqoop, Cloudera Manager, Mahout, Flume, Storm, Spark, ZooKeeper, Ambari, HBase, Splunk, and some other tools I have yet to come in contact with.


Bash Badawi

Bash Badawi has over 24 years of technology experience ranging from an internship at NASA to providing IT advisory services in over 30 countries around the globe in Europe, Africa, Asia and the Americas advising governments to adopt an Open Data Policy. Mr. Badawi can be reached on his LinkedIn profile or over email at I welcome all feedback as I am an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part II

In Part I, I gave an overview of Big Data, what it is, and what are the (ideal) skills (a lot!) you need to get started and named some of the key players. In this post, I will cover some of the most important technologies, some Big Data applications examples and the architecture which underlies the Hadoop/Big Data Framework.

A Historical Background

This paragraph should have been in the first post. However, it is worth mentioning for those who are like myself, people who want to know how “things came to be” as opposed to “they are just here so deal with it!!”

It all started at Google while searching for a solution in trying to index the web. In a very simplistic view, the engineers at Google aimed to put together a (Key, Value) pair of words and the URLs of the sites containing those words. It sounds simple on paper, however, once you consider the sheer volume of sites, the complexity because of the volume of data begin to crystalize.

Soon after Google successfully designed the two main components behind such an effort, the Google Distributed File System, and the MapReduce framework and moving moved past them so they are no longer a trade secret probably, they published a paper on the technology used to index the internet! The uniqueness of Google’s problem lies in the fact that the solution required horizontal scalability (i.e. many nodes doing smaller tasks) as opposed to the horizontal scale-up (i.e. more and more beefed up servers with more memory, etc.

Once Google let the proverbial cat out of the hat, two guys at Yahoo! Had a massive a-ha moment and developed Hadoop (Doug Cutting and Mike Cafarella in 2005, Wikipedia) who were working on a similar problem had a massive a-ha moment and developed Hadoop starting with the Hadoop Distributed File System (HDFS) and the MapReduce framework and started the implementation of what resembled the Google approach.

So, what are the Core Hadoop Components?

There are two main component at the code of Hadoop. Although there are many tools available, their purpose is to facilitate the interaction of those two component.

Hadoop Distributed File System (HDFS)

HDFS is not a file system like FAT or NTFS. It’s a layer which sits atop the traditional file system and handles loading, distributing and replication of files.  What you need to know about HDFS is that when you want to add a node to your Hadoop cluster, you have to load HDFS so it can participate in the cluster operations. It’s like the File Allocation Table (FAT). HDFS splits up big Hadoop jobs and distributes them to all the Nodes in a duplicate fashion, i.e. a chunk of data would be sent to 2-3 nodes at the time. Why? Because Hadoop was designed for failure! Sounds strange? I am squeeze in a paragraph or so about Functional Programming where the Map function closely resembles function programming paradigm.

Primary and Secondary Name Nodes

As with every file systems, you need to know where stuff is located. HDFS implements this by designating a Hadoop server as a primary Name Node which keeps an inventory of where data is located such that if a node decides to take a vacation, other nodes will step in and replace it.


MapReduce is the essence of a Hadoop system. It is the business logic translating statistical models or execution plan to achieve what we had set out to do, answer questions. The logic is usually encapsulated in a JAVA application with a, you guessed it, a Map and Reduce function specifying what data types to expect, what to do with outliers and missing data, and so forth.

Here is a snapshot of what a Word Count Java application would look like. In a nutshell, the Mapper Tokenizes each word and counts it for a small chunk of the data. The Reducer takes the output of the Map which is many (Key, Value) pairs as (word, count) and aggregates the count. A part of the App is to configure the job to submit it to Hadoop which is something you can find out on your own.

Business Use

Obviously businesses do not build such massive systems without ample justification of how it will help them in solving current problem (bottom and top line) and to plan for solution for mid to long term issues they face. They aim to use the massive parallel compute power at their disposal to answer questions about customer or system behavior using large amounts of structured, semi-Structured and unstructured data. Questions like a Cable or Mobile Phone Companies asking itself what are people not renewing their contracts (called churn rate)? Or a large Agricultural Co-op deciding on the amount of a certain fruit or vegetable to grow for the next season based on data other than the future contract prices from the stock exchange), or a local police force trying to understand why is the crime rate is high at a certain location and so forth.

So, Now What? What’s the Big Deal?

So now we have this great framework being used by Internet giants such as Facebook and so on, where people like you and me who work for consulting companies position Big Data and the benefits. The answer comes from one my favorite researchers/professors at MIT whom I studied his work with zeal while in Economics School. Erik Brynjolfsson, studied the performance of companies that excel at data-driven decision-making and compared it with the performance of other firms. And guess what the outcome was? They found that productivity levels were as much as 6 percent higher at such firms than at companies that did not emphasize using data to make decisions which did not surprise me at all.

So as a business executive, entrepreneur, consulting firm, or a software provider, there is a huge cost for doing nothing! Case closed!! I would really like to hear your thoughts on what you would like to see in Part III. Right now, I am planning to highlight some of the technologies which makes the user of Hadoop easier, especially for Data Analysts, however, I am open for ideas. Hope you enjoyed the post.


Bash Badawi

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part I

This might be a helpful post for those who are interested in the latest development of Big Data technologies, Hadoop being front and center among other tools and utilities.

For the first time in my Cloud experience, I can honestly say that Big Data and Hadoop are the killer apps for cloud computing. The clear delineation between do-it-yourself or get it via a Hadoop distro with a clear cost-benefit analysis is something I rarely found to be at such a granular level especially if you are looking to get training or do a quick proof-of-concept. My curiosity drove me to test-drive the five main players who had accessible technologies for eval purposes:

  1. Cloudera (a Yahoo! Spin off),
  2. Hortonworks,
  3. MapR,
  4. Microsoft’s HDInsight and
  5. The final distro is a DYI one which I also experimented with straight from the Apache Hadoop standards site built on Centos and the latest Hadoop Distro.

There are major differences between 0.1.1 and 0.2.2 of Hadoop. There are also major differences in supporting tools based on your vendor of choice. Although this article is not a bake-off or a showdown, I will cover the good, bad and ugly and this might be split into a series of posts and will give the labor the fruits of my labor whether you are an IT Manager, Architect, a Developer, a Data Analyst or a Statistician.

What is Big Data?

I won’t spend much time explaining obvious things you can Google or YouTube. In very simple terms, Big Data is what your SQL Server cannot handle no matter how much disk space, CPUs and RAM you throw at it, and even if you could, it would be a very costly proposition. The KEY instead is scaling out versus scaling up so basically you scale out using commodity hardware instead of beefing up your SQL box with heaps of memory, cores, etc.

Why Scaling Out Versus Scaling Up?

Databases were designed for sequential read and write and not for Random read and writes. The reason being is that they were highly affected with the hardware architecture of storage devices which needles and spindles spinning fast enough to fetch the data we need. However, with the advent of SSD drives, this no longer applies (Uncle Bob has an excellent talk on this if you are interested). Therefore, the ubiquitous support for SSD drives, and by extension random reads and writes is no longer a factor in fetching data and the performance hits of the past!

Data Formats

Big Data comes in different flavors, structured, unstructured, dirty (as in missing elements), Volatile, and comes at you like water from a firehose. Think of Twitter or a Facebook or the thousands of Kindles which log every click, page flip, etc, as an example where data is constantly updated on a massive scale.

These characteristics are called the 4 V’s, Volume, Velocity, Veracity and Value of Data. Your job is to create order of the data chaos and you do so by a number of predefined algorithms, and new ones, to create such order.

So, how do you get a handle on such massive amounts of data? Enter MapReduce. MapReduce is a simple or complex algorithm with the primary function of “mapping” words into (Key, Value) pairs, sort and shuffle the output based, and aggregates on the Reduce routine. Confused? I thought so!

The best way to explain it is through an example, Word Count. Word Count is the “Hello World” example that Hadoop newbies have to do in order to test their installation, get an overview of how Hadoop works, and how Hadoop breaks down the problem in small batch jobs which are submitted, executed, and outputted.

An Even Simpler illustration of MapReduce

Imagine that you have a blackbox. You feed the box with a corpus of documents, say one of Shakespeare’s masterpieces. The Map portion takes the document and chunks it into manageable chunks, say a 1000 small pieces which make up the entire book. It then counts the words per chunk and outputs the files with the words and their count for each “chunk”.

Enter a Reduce

A Reducer takes the output of the Mapper and further aggregates the count of the words from each chunk, and repeats the process until we have just one document which has the count of each word in the document and Voila! We get a count for every word written in the book.

What Sort of Skills are needed to Get Started?

Big Data is a team sport. You have to have multiple skills on the team, and according to Rachel Schutt and Cathy O’Neil in their great book, Doing Data Science: Straight Talk from the Frontline, which lists multiple skills such as:

  1. Computer Science
  2. Math
  3. Statistics and Statistical Modeling
  4. Data Integration (I added this one since in most of my experiences you had to pull data from other sources including relational databases)
  5. Machine Learning
  6. Data Visualization


In Part II of the post, I will delve deeper into each distro and discuss the tools, pros and cons of each and finally where the future of this amazing technology is headed.

How Applied Software Economics Can Solve Technology Providers Problems

In an earlier post, I talked about the role of Economics in the Software Industry, or lack thereof, and how I embarked on a journey to seek a Master’s Degree in Business Economics to seek out the truth of whether or not economics can be applied to software technology to solve some of the problems ailing the industry.

I pursued my economics degree in earnest to investigate whether or not there are potential benefits/explanations in applying an interdisciplinary approach and if the application of certain economic concepts could have the potential to positively impact the software industry. The software industry which is riddled with many infamous stories of epic failures (!!), monopolistic behavior, collusion (see this amazing post on a secret non-poaching pact in Silicon Valley titans of technology),

All of the companies I worked for did not employ anyone with an Economics Degree (even if they did, that was not their primary job), just simply Finance! Almost the entire middle management layer are financial planners with a massive preoccupation of the 30/60/90 day budget/revenue planning/forecast, revenue attainment, etc, cycle which occurs, four times a year and they were busy making calls to folks in the field asking whether they will close the deal or not! You know the type and you almost always feel that while you are doing actual valuable work, they are just counting the beans you are bringing in!!

Anyway, to offset this massive resource drain of corporate resources of “Finance” layers, groups started sprouting up within the boundaries of the corporation and sometimes through a recommendation from an outside firm (a Management Consulting firm) with the groups’ primary focus is on Corporate Strategy, or mid to long-term planning, or sometimes the highly misnomer R&D!!. The cost of not doing so proved fatal to many companies who failed to “plan” to compete, or simply ignored to have a compete strategy as they were blindsided by smaller more nimble technology startups which overtook them. I bet you can name ten of those tech companies right now that no longer exist. You know, the software darlings of the times with the meteoric rise and speed-of-light fall.

There has been many books, research papers, etc, which study the failure of businesses. I had to read through quite a few in my recent University days (in Economics it’s called “Creative destruction” and many other terms related to Darwin’s theory of evolution). As an example, Apple and Microsoft are a great case study. At some point Apple’s “closed” business model almost cost it its own existence. However, when the PC world became the Wide Wild West of cheap components, buggy software, and so on, Apple’s business model forged ahead with a simple advantage “more stability and security”.

The economic principles behind both camps, Apple and Microsoft, were at the opposite end of the spectrum. I am talking hardware-wise. Apple was “locked, proprietary” and a “closed” ecosystem where MSFT was “Open”.

Enough theory and back to reality. If you are a business owner, executive, IT Manager, I hate to break the bad news to you: There is a cost to doing nothing! Please allow me to explain: You built a product or a service, now you are ready to sell it, how do you price it? Do you bundle it with another product? Do you follow the herd of free then premium? Build it, let will come, and then figure out a successful revenue model?! How about lock-in? Can you assure you customers they are not going to be locked in for the rest of the life of your software? The list is very long and I have yet to scratch the surface and what’s more scary is going at it alone without the aid of any theory which could be applied to address such issues.

For example of applied economics to solved project failure is an approach to treat software functionality completion as stock options where the maturity date is the completion date for that particular functionality (I borrowed this one from Barry Boehm). This is where a completion of say a report accounts for X number of options pre-agreed upon prior to the project start date. Missed the date: Your options are underwater and it’s time to focus on not missing the next cycle.

Ideas such as this one, grounded in economic theory can greatly improve performance, motivate people, and are extremely creative to solve chronic problems (how many times have you heard about projects missing deadlines with cost overruns, etc.)…. Just food for thought!