Category Archives: Big Data

How the Azure Analytics (R)evolution will Alter the Hard Truth about Data Science

If you are in technical sales or trying to break into Big Data or Data Science this post is definitely a must-read for you. It is based on real-life examples, research and analysis from a practitioner stand-point..

There is a sad but realistic truth about Data Science (DS henceforth) and Big Data (BD). There is an unfortunate dichotomy in this space:  Those who are on the inside track (the Insiders), and those who are trying to get in (theOutsiders).  In the world of the “Insiders” there exists a persistent message which garnered support amongst employers which amounts to touting the Outsiders and raising the entry barrier. How? By raising the number of skills required. I came up with a list of must-have skills for a BD or a DS role based on my own analytics exercise on hundreds of DS/BD job descriptions I gathered from a web scraping. So guess how many skills are required? Twenty two! Setting the bar at least at twenty two skills, and counting, is ridiculously unrealistic!

 Enter Microsoft!!

The will-be DS/BD game changer happened quietly with not much coverage in the tech media world. This was Microsoft’s acquisition of Revolution Analytics. So, how does the Microsoft acquisition have to do with Outsiders wanting to get into Data Science? A Lot!! Let me explain.

The one fundamental skill which is a must-have in DS is Statistics! And what does the DS community use as a Statistics programming language? R!.Here is where the story gets interesting.

A branch of DS is Machine Learning or Predictive Analysis. This is the true value-add for any BD initiative: to be able to make a data-driven decision about future strategy.

To do this in R, the very popular Analytics language, you need to go through training, etc, become an expert before you can attempt to get close to solving problems. R is archaic. It is a reincarnation of languages called S and S-Plus and it is 40-years old.

The great news is that Microsoft just acquired Revolution Analytics (hence the post title). Revolution R (a product of Revolution Analytics) is known for enabling parallel processing of R resulting in massive performance gains.

Today, R is supported in Azure Machine Learning in its current format. So why does this matter? The reason is very simple. There is talk that Microsoft will probably take R, modernize it and make it a first class language, add it to Visual Studio with Intellisense support and allow everyone to develop Azure ML solution in Visual Studio ready to  be published directly to Azure. The net effect:

  •  It will be nothing short of a coup for the Outsiders in the DS/BD field who can now develop solutions without the need for the massive R learning curve cost.
  • Azure ML already provides a lot of the R functionality today out-of-the-box so it’s a natural extension to the existing functionality
  • Azure ML makes it very simple to share code, models, and common problem solutions thus:
  • I will get my solutions at 10X time-to-market
  • I can profit from my work if I choose to publish it in the Azure Gallery
  • It will change the market dynamics as it will increase the short supply of talent of BD/DS
  • And will unleash the genius of statisticians, business users with minimal programming experience to conduct their own experiments.
  • Coupled with HDInsight which has access to Analytics APIs, this is the more efficient BD solution on the market today

 

What’s more: Keep an eye out on the space. As I mentioned, if you are in Microsoft technical sales, this will be your ticket to moving large enterprises onto Azure with a strategy rather than just a plain tactical need!! If you are evaluating DS technology today, you probably know very little about Azure ML and the functionality provided. I would highly encourage to evaluate Azure ML before heading to a Cloudera or MapR or any other Hadoop vendors. You will not be disappointed!

 

About Me:

I am a freelance consultant with over 24 years of experience in IT, strategy and Economics. I specialize in Cloud, Data Architecture and DS, Machine Learning, corporate strategy and provide architectural consulting, training, technical research, data-driven decision making solutions based on economics-based statistical methods grounded in scientific frameworks.

The 22 Skills List of Data Science:

  1. R Programming
  2. Getting and Cleaning Data
  3. Exploratory Data Analysis
  4. Reproducible Research
  5. Statistical Inference
  6. Regression Models
  7. Practical Machine Learning
  8. Developing Data Products
  9. Data Visualization
  10. DBA
  11. Hadoop (including Azure HD Insight technology stack
  12. Orchestrate data workflows
  13. Data ingestion/curation using Pig, Hive, Sqoop or other Hadoop tools
  14. Hadoop cluster configuration using Hadoop big data architecture
  15. High-level design using Business Analysis, Microsoft Azure Platform Knowledge, Blob Storage API Knowledge
  16. Blob Storage API Knowledge
  17. Metadata management tool.
  18. Model client data
  19. Mapping
  20. Data profiling – Information analyzer/Excel preferred
  21. Decide how data is going to be used to make decisions, and
  22. Knowledge of both tools and methods from statistics, machine learning, software engineering, as well as being human and show persistence
Advertisements

The Big Data Posts: The Last Word….and The Fun Ones!!

Sorry, but I couldn’t let go =). Seriously though, I have some more information I felt that I have to share with my friends at LinkedIn. Information which are more like “Lessons’ Learned” on what NOT to do and what NOT to expect with Hadoop and Big Data on Day 1 forward!

Hello World!!

What is Hadoop’s equivalent of “Hello World”? The answer is Word Count. Why so? Because Word Count (where I showed a code shot of a MapReduce Word Count in Part II the code is supported by every vendor. In the Hello World Tutorial, you will download a book from project Gutenberg or any copyright-expired books from Google Books (Cloudera comes with Shakespeare’s Romeo and Juliet in .TXT format) then you will “Load” the file into HDFS and Hadoop will spit out each word in the book and its count. One of my favorites in an inverted index example which mimic a search engine, the results give you the word and the name of the books it’s in or the count of the books it is in. NOTE: The OS file system is completely different from the Hadoop File System, HDFS.

Now onto the Fun Stuff

In Part III, I answered the question: “What do developers do after shipping ANY software product?”  The answer was: “They develop the interfaces and tools to make it usable!!” Why? Because Version 1 is almost always about entering the market. All the free education training resources? They are debug tools, workarounds, and persuasion to try and keep those developers around.

I once heard Bill Gates say “We [Microsoft] excel at abstracting complexity. Take any task in Excel for example, go back 2-3 years, and compare the time difference it takes to achieve the same results!”. The statement stuck to this day because it is very important. It’s about Buy-versus-Build, in-house IT specialization decisions, IT investment choices, and so on. We continue to debate all of these arguments to this date.

And Hadoop is no different and will not be!! There is a race going on right now of who can provide a better abstraction or more specialized-purpose utility layer on top of Hadoop MapReduce, unstructured data, HDFS, all through utilities such as Hive, HBase, Pig, Flume, Sqoop, Ambari, Kafka. Mahout, YARN, Oozie, Sqoop, Storm, ZooKeeper, Phoenix just to mention a few. You can details or summaries of each at Http://Apache.org Now, the reasons for such progress which is ironic when progress is to make something work with ease or work, period. The reasons for the Progress:

Reason # 1 – Math, Statistics, and Statistical Modeling

Software providers know how difficult it is for the lay programmer who probably graduated with an Art History or Psychology Major (who permeated the software industry long ago) cannot write a Logistic Regression Model, or build a model from scratch to save his life. The results: Build him/her a utility. Remember VB?! The OO community was on the brink of rioting at 1 Microsoft Way, Redmond, WA for such a product which had nothing to do with OO (well, a little bit), however, it did take off. When I asked my professor in college about it (he was a brainiac, one of the originals on IBM’s Big Blue Chess program), he answered, my grandmother can write code, but she cannot optimize it, so stick to the optimization part!!

Reason # 2 – MapReduce is just plain difficult and alienates many data analysts with invaluable domain knowledge

This is where companies are already using technologies like Pig and Hive, Impala, and others software’s purpose is to eliminate the program having to transfer the math solution into a Java MapReduce job! The problem: Good luck debugging the generated code, in addition to the more dependence (vendor lock-in) on the major players in support contracts so bye-bye open source free software you touted to your boss, hello maintenance contracts, and time to revamp the resume! JK.. (I did not mention products such as Datameer or RedPoint or LigaData, and the list is very long)

Reason # 3 – It’s so Buggy, You Have to Start in the Cloud

This is a no-brainer. Believe me. Don’t even think about building an on-premise 100 node cluster for testing out the need for Big Data at your company. You might extend from the cloud to on-premise but your strategy should be “Cloud-it-and-forget-it” to start with.

Honestly, Big Data this is the first application I have seen where your mode of execution should be Cloud-First and everything else later. So, if you someone asks you about your cloud strategy, now you can pretend that you actually have one!! Why pushing here? It’s because I can provision a Hortonworks HDInsight on Hadoop

Finally, software choices: I mentioned the vendors in Part III, but I will do it so I won’t lose you, the Reader,

    1. Microsoft’s HDInsight:
      • However, the Cloud version is a bliss!! Few minutes to create the Azure Storage Service, (free eBook here) and a few more to create the cluster with Hortonworks Hadoop running in a Linux VM which you interact with by using PowerShell. You NEED to learn PowerShell especially at the release of Windows 10, CMD.Com will no longer be supported!  Back to HDInsight: With Cloud Storage as a Service, you can spin up a Hortonworks image, where HDFS is actually Azure Blob storage, run your scenario, move it back to Storage you created earlier, and spin the Hortonworks cluster down. Total Cost: very little because all will work and if it doesn’t you have someone (Microsoft) to yell at. Cloudera offers similar functionality (discussed later). One more KA feature: There is an Emulator provided by Microsoft which you can try locally before shipping it off to Azure. IF you can install it. Seriously, let me know if it installs for you!!!
      • If you try the Hortonworks Data Platform installation Sandbox (VM), you get an error in a log file in a directory that does not exists so you don’t actually know what happened. But wait, there is more: You can install the Hortonworks software in a VM running Windows which will run a lot of JVM processes for each feature with APIs made available to Windows and the utilities supplied by Microsoft, one of which is Data Visualization, a very important aspect of any Hadoop installation. How is this done? Just add in PowerPivot, PowerQuery, any SQL BI utility, and install the Hive ODBC driver and you can connect to the data directly and run some awesome visualizations long after the Hadoop cluster is gone.
      • This is how Hortonworks and Microsoft bring back the alienated Data Analysts crowd. Come on: How many Analysts do you know who does not know Excel?!! Others do too, however, the tools are not Office!!

2 Cloudera:  Provides four flavors, straight-up cloud, KVM, VMWARE, and VirtualBox. It will install smoothly (The VirtualBox that I checked out), and the samples will work *Sometimes*! however, the ones which don’t work, TAKE my advice, leave them alone. You can debug all you want, but there are so many moving parts that unless you luck out, otherwise. Keep. On. Trucking. Since I have a KA Windows Server, I was able to convert the VirtualBox image to Hyper-V and it worked fine. I was actually able to do the same for all vendors.

3.Hortonworks: Also provides the same flavors as Cloudera, however, as I mentioned earlier, go with the cloud option if you can, but it’s Azure for the Microsoft-haters out there!!. Azure gives you a month free trial. I used the Hyper-V image because I have a KA Windows server. The ugly thing is that, you spin up the image, and you connect to it via a browser or SSH. Not the prettiest UI. I also tried to get the Windows Data Platform to install and work. After many and I mean many hours of trying, nothing worked. Cryptic error messages, bad version control, HDP-2.0.6, 2.2.0, 2.4.2.2, and so many more! I have to point out that there is a Hortonworks for Azure VM Template which you “Quick Create” as mentioned above, and an HDInsight Hadoop on Linux  (cloud and Sandbox with an SSH or browser access” access), and a Hortonworks Data Platform MSI package for Windows which will install on a Windows Server but the prerequisites are too much and if something break, you have nowhere to look first. This is the Windows distro can be installed directly on a Windows VM Server box as well and is available from Azure.

4. .MapR,  NOTHING worked for me. Period. I tried debugging but felt like there was no critical mass out there to answer questions. I ran out of time. I just moved on…. I do need to mention that recently MapR has joined forces with Amazon and now offers Hadoop VM Service on Amazon and they offer a “decent” virtual academy lessons. Unfortunately, the free videos are very limited.

5.Amazon EMR: It Costs money to try! Unlike Microsoft and the other vendors who provide you with a “Sandbox” Amazon does not and with my super low budget, I did not try it. I will try it soon enough though. Amazon is the pioneer of Cloud Services and a very innovative company wo want to rent out the massive data and compute they have plenty of. It’s platform independent and offers everything possible. EMR and the entire Hadoop functionality is available as a set of APIs. Developers Rule!!

6.Finally, the final option is a DYI one which I also experimented with straight from the Apache Hadoop site built on CentOS and the latest Hadoop Distro. It was a nightmare. If anything breaks good luck debugging it or finding the source of the problem. Remember this is open source and no one is obligated to document anything if they don’t feel like it that day!! You are at the mercy of history (i.e. someone had the problem, solved and shared it), or someone in the community who will jump in to help you.

For any project plan you come up with for a Big Data project, in your project plan add a resource, and name him Google or Bing based on the flavor of your implantation and access to support, and assign him tasks such as research, Books, Professor, because he will save your life.

The Last Word:;

Remember the graph I drew up with the required skills for Data Scientist? Well, I made a few modification and thought I’d share it with you, my friends on LinkedIn so if you are starting out, you know which are the important areas to start with first (usually the hard ones)..

Big Data Skills

 

Cheers,

Bash

 

Bash Badawi has over 24 years of software development experience and is currently actively looking for a home to contribute his knowledge. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. He welcomes all feedback, questions, request for consultations, as he is an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it quickly – Part III

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part III

Well, sadly, we have come to the end of the topic which I have tremendously enjoyed (unless of course I change my mind and can think of something compelling to do a part 4). In part I and part II, I discussed what is Big Data, what you needed to know among other things. In Part III, I will start the post with a questions: What do software engineers/communities/companies do after they ship software? I say Engineers/Communities/companies since the lines are so blurred now we don’t know who is who!, you know, open source, services-only, countries backing open source initiatives, etc). I will provide the answer at the end of the post for a suspenseful effect.

The purpose of this post is that I will tell you what else is out there, the ugly stuff to expect and how to get started:

  1. Expectation: Don’t expect a lot!! Meaning: DON’T EXPECT THINGS TO WORK right out-of-the-box. You will run into every conceivable problem you can think of whether it is a built-in feature or add-on
  2. How to get started (it’s very hard, you’ll see what I mean). The first thing is Cloudera was probably the least problematic and some features actually worked (after some long hours of debugging). Hortonworks was my number 2. The cool thing about Hortonworks is that they offer a Linux and a Windows version. (I think Cloudera is about to offer the same) The Hortonworks Windows version, called Azure HDInsight, uses Blob storage on Azure instead of HDFS (pretty smart from Microsoft to keep Windows relevant in the space). Here is a link to the Hortonworks Architecture image below for both stacks.

HortonCross-Plat

Why is the Microsoft Platform so important? IMHO, because data visualization is an important part of the overall Hadoop skillset. Presenting the information might make or break the project. So why is Microsoft Office so important here, it’s because has some powerful data visualization tools (Power Pivot, Power Query, Excel) which will ensure the buy-in and inclusion of the business users and most importantly, the data analysts who like to not just view data but also run queries over Hive ODBC.

Additionally, and there is a huge plus for the Microsoft-Hortonworks partnership, it’s the tight integration of Microsoft Center with Hortonworks’s Hadoop as per the image below (think of hundreds of nodes), link here. Other vendors offer a form of monitoring, however, the UI is nothing like System Center.

HadoopSysCntr

There are other players in the field I did not mention such as:

HadoopUsage

Image by way of http://thirups.blogspot.com/2013/04/beyond-buzz-big-data-and-apache-hadoop.html and http://2.bp.blogspot.com/ as well.

As you can see, the market place is way overpopulated which means it is ripe for acquisitions/consolidation. I bet you if I republish this post a year from now, the picture above will be just a hand-full of market agents.

GartnerHyCy

Figure 1 Image is Gartner Research Hype Cycle for Big Data 2014 by way of http://www.Clickinsight.ca. Link is here

Additionally, based on the Gartner Hype Cycle, I do believe we are somewhere between the Peak and Climbing the Slope. It depends on which Sector you examine (e.g. telco is ahead of transportation)

You might ask, how and where do I start? I need to acquire those skills fast to cash-in on this lucrative trend! My suggestion is to start with major Big Data vendors I mentioned in Part I of the series. They all have Sandboxes, or Cloud images for training purposes.

Therefore, start with the Virtual Machines!, Cloudera and the Cloudera Sandbox, Hortonworks on—premise Sandbox, Hortonworks Azure HDInsight cloud startup is a snap to get up and running Amazon EMR (Elastic MapReduce) , MapR, or Do-it-Yourself (DIY) for the brave-hearted (Udemy & Coursera have good video courses on how to get started from scratch. Good luck mate!)

I have had major pains in all of them. They were so buggy that I deleted and started over for each one of them.

I recommend that you start with a Cloud-based Sandbox in the order mentioned above then try the On-prem. Now, you can build your own, from scratch, but unless you are a Linux- Java-I-can-solve-everything guy, DON’T! You will be wasting lots of time doing something someone else did, i.e. built you a nice VirtualBox or VMWARE or Hyper-V VM and debugged it all. That doesn’t mean you won’t have to debug things yourself. Believe me, you will run into a problem somewhere and have to debug the “ready” image or cloud. If you can recreate the cluster on the cloud, just do it.

The answer to the question: What do software engineers/communities/companies do after they ship software? They make it work by starting minor releases which did not make the cut, add tools they used to debug the build itself, and add tools to make it easier for the user to manage and use the system. Tools such as: Hive, Pig, Impala, Sqoop, Cloudera Manager, Mahout, Flume, Storm, Spark, ZooKeeper, Ambari, HBase, Splunk, and some other tools I have yet to come in contact with.

Cheers,

Bash Badawi

Bash Badawi has over 24 years of technology experience ranging from an internship at NASA to providing IT advisory services in over 30 countries around the globe in Europe, Africa, Asia and the Americas advising governments to adopt an Open Data Policy. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. I welcome all feedback as I am an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part II

In Part I, I gave an overview of Big Data, what it is, and what are the (ideal) skills (a lot!) you need to get started and named some of the key players. In this post, I will cover some of the most important technologies, some Big Data applications examples and the architecture which underlies the Hadoop/Big Data Framework.

A Historical Background

This paragraph should have been in the first post. However, it is worth mentioning for those who are like myself, people who want to know how “things came to be” as opposed to “they are just here so deal with it!!”

It all started at Google while searching for a solution in trying to index the web. In a very simplistic view, the engineers at Google aimed to put together a (Key, Value) pair of words and the URLs of the sites containing those words. It sounds simple on paper, however, once you consider the sheer volume of sites, the complexity because of the volume of data begin to crystalize.

Soon after Google successfully designed the two main components behind such an effort, the Google Distributed File System, and the MapReduce framework and moving moved past them so they are no longer a trade secret probably, they published a paper on the technology used to index the internet! The uniqueness of Google’s problem lies in the fact that the solution required horizontal scalability (i.e. many nodes doing smaller tasks) as opposed to the horizontal scale-up (i.e. more and more beefed up servers with more memory, etc.

Once Google let the proverbial cat out of the hat, two guys at Yahoo! Had a massive a-ha moment and developed Hadoop (Doug Cutting and Mike Cafarella in 2005, Wikipedia) who were working on a similar problem had a massive a-ha moment and developed Hadoop starting with the Hadoop Distributed File System (HDFS) and the MapReduce framework and started the implementation of what resembled the Google approach.

So, what are the Core Hadoop Components?

There are two main component at the code of Hadoop. Although there are many tools available, their purpose is to facilitate the interaction of those two component.

Hadoop Distributed File System (HDFS)

HDFS is not a file system like FAT or NTFS. It’s a layer which sits atop the traditional file system and handles loading, distributing and replication of files.  What you need to know about HDFS is that when you want to add a node to your Hadoop cluster, you have to load HDFS so it can participate in the cluster operations. It’s like the File Allocation Table (FAT). HDFS splits up big Hadoop jobs and distributes them to all the Nodes in a duplicate fashion, i.e. a chunk of data would be sent to 2-3 nodes at the time. Why? Because Hadoop was designed for failure! Sounds strange? I am squeeze in a paragraph or so about Functional Programming where the Map function closely resembles function programming paradigm.

Primary and Secondary Name Nodes

As with every file systems, you need to know where stuff is located. HDFS implements this by designating a Hadoop server as a primary Name Node which keeps an inventory of where data is located such that if a node decides to take a vacation, other nodes will step in and replace it.

MapReduce

MapReduce is the essence of a Hadoop system. It is the business logic translating statistical models or execution plan to achieve what we had set out to do, answer questions. The logic is usually encapsulated in a JAVA application with a, you guessed it, a Map and Reduce function specifying what data types to expect, what to do with outliers and missing data, and so forth.

Here is a snapshot of what a Word Count Java application would look like. In a nutshell, the Mapper Tokenizes each word and counts it for a small chunk of the data. The Reducer takes the output of the Map which is many (Key, Value) pairs as (word, count) and aggregates the count. A part of the App is to configure the job to submit it to Hadoop which is something you can find out on your own.

Business Use

Obviously businesses do not build such massive systems without ample justification of how it will help them in solving current problem (bottom and top line) and to plan for solution for mid to long term issues they face. They aim to use the massive parallel compute power at their disposal to answer questions about customer or system behavior using large amounts of structured, semi-Structured and unstructured data. Questions like a Cable or Mobile Phone Companies asking itself what are people not renewing their contracts (called churn rate)? Or a large Agricultural Co-op deciding on the amount of a certain fruit or vegetable to grow for the next season based on data other than the future contract prices from the stock exchange), or a local police force trying to understand why is the crime rate is high at a certain location and so forth.

So, Now What? What’s the Big Deal?

So now we have this great framework being used by Internet giants such as Facebook and so on, where people like you and me who work for consulting companies position Big Data and the benefits. The answer comes from one my favorite researchers/professors at MIT whom I studied his work with zeal while in Economics School. Erik Brynjolfsson, studied the performance of companies that excel at data-driven decision-making and compared it with the performance of other firms. And guess what the outcome was? They found that productivity levels were as much as 6 percent higher at such firms than at companies that did not emphasize using data to make decisions which did not surprise me at all.

So as a business executive, entrepreneur, consulting firm, or a software provider, there is a huge cost for doing nothing! Case closed!! I would really like to hear your thoughts on what you would like to see in Part III. Right now, I am planning to highlight some of the technologies which makes the user of Hadoop easier, especially for Data Analysts, however, I am open for ideas. Hope you enjoyed the post.

Cheers,

Bash Badawi