Category Archives: Software

How the Azure Analytics (R)evolution will Alter the Hard Truth about Data Science

If you are in technical sales or trying to break into Big Data or Data Science this post is definitely a must-read for you. It is based on real-life examples, research and analysis from a practitioner stand-point..

There is a sad but realistic truth about Data Science (DS henceforth) and Big Data (BD). There is an unfortunate dichotomy in this space:  Those who are on the inside track (the Insiders), and those who are trying to get in (theOutsiders).  In the world of the “Insiders” there exists a persistent message which garnered support amongst employers which amounts to touting the Outsiders and raising the entry barrier. How? By raising the number of skills required. I came up with a list of must-have skills for a BD or a DS role based on my own analytics exercise on hundreds of DS/BD job descriptions I gathered from a web scraping. So guess how many skills are required? Twenty two! Setting the bar at least at twenty two skills, and counting, is ridiculously unrealistic!

 Enter Microsoft!!

The will-be DS/BD game changer happened quietly with not much coverage in the tech media world. This was Microsoft’s acquisition of Revolution Analytics. So, how does the Microsoft acquisition have to do with Outsiders wanting to get into Data Science? A Lot!! Let me explain.

The one fundamental skill which is a must-have in DS is Statistics! And what does the DS community use as a Statistics programming language? R!.Here is where the story gets interesting.

A branch of DS is Machine Learning or Predictive Analysis. This is the true value-add for any BD initiative: to be able to make a data-driven decision about future strategy.

To do this in R, the very popular Analytics language, you need to go through training, etc, become an expert before you can attempt to get close to solving problems. R is archaic. It is a reincarnation of languages called S and S-Plus and it is 40-years old.

The great news is that Microsoft just acquired Revolution Analytics (hence the post title). Revolution R (a product of Revolution Analytics) is known for enabling parallel processing of R resulting in massive performance gains.

Today, R is supported in Azure Machine Learning in its current format. So why does this matter? The reason is very simple. There is talk that Microsoft will probably take R, modernize it and make it a first class language, add it to Visual Studio with Intellisense support and allow everyone to develop Azure ML solution in Visual Studio ready to  be published directly to Azure. The net effect:

  •  It will be nothing short of a coup for the Outsiders in the DS/BD field who can now develop solutions without the need for the massive R learning curve cost.
  • Azure ML already provides a lot of the R functionality today out-of-the-box so it’s a natural extension to the existing functionality
  • Azure ML makes it very simple to share code, models, and common problem solutions thus:
  • I will get my solutions at 10X time-to-market
  • I can profit from my work if I choose to publish it in the Azure Gallery
  • It will change the market dynamics as it will increase the short supply of talent of BD/DS
  • And will unleash the genius of statisticians, business users with minimal programming experience to conduct their own experiments.
  • Coupled with HDInsight which has access to Analytics APIs, this is the more efficient BD solution on the market today

 

What’s more: Keep an eye out on the space. As I mentioned, if you are in Microsoft technical sales, this will be your ticket to moving large enterprises onto Azure with a strategy rather than just a plain tactical need!! If you are evaluating DS technology today, you probably know very little about Azure ML and the functionality provided. I would highly encourage to evaluate Azure ML before heading to a Cloudera or MapR or any other Hadoop vendors. You will not be disappointed!

 

About Me:

I am a freelance consultant with over 24 years of experience in IT, strategy and Economics. I specialize in Cloud, Data Architecture and DS, Machine Learning, corporate strategy and provide architectural consulting, training, technical research, data-driven decision making solutions based on economics-based statistical methods grounded in scientific frameworks.

The 22 Skills List of Data Science:

  1. R Programming
  2. Getting and Cleaning Data
  3. Exploratory Data Analysis
  4. Reproducible Research
  5. Statistical Inference
  6. Regression Models
  7. Practical Machine Learning
  8. Developing Data Products
  9. Data Visualization
  10. DBA
  11. Hadoop (including Azure HD Insight technology stack
  12. Orchestrate data workflows
  13. Data ingestion/curation using Pig, Hive, Sqoop or other Hadoop tools
  14. Hadoop cluster configuration using Hadoop big data architecture
  15. High-level design using Business Analysis, Microsoft Azure Platform Knowledge, Blob Storage API Knowledge
  16. Blob Storage API Knowledge
  17. Metadata management tool.
  18. Model client data
  19. Mapping
  20. Data profiling – Information analyzer/Excel preferred
  21. Decide how data is going to be used to make decisions, and
  22. Knowledge of both tools and methods from statistics, machine learning, software engineering, as well as being human and show persistence

The Big Data Posts: The Last Word….and The Fun Ones!!

Sorry, but I couldn’t let go =). Seriously though, I have some more information I felt that I have to share with my friends at LinkedIn. Information which are more like “Lessons’ Learned” on what NOT to do and what NOT to expect with Hadoop and Big Data on Day 1 forward!

Hello World!!

What is Hadoop’s equivalent of “Hello World”? The answer is Word Count. Why so? Because Word Count (where I showed a code shot of a MapReduce Word Count in Part II the code is supported by every vendor. In the Hello World Tutorial, you will download a book from project Gutenberg or any copyright-expired books from Google Books (Cloudera comes with Shakespeare’s Romeo and Juliet in .TXT format) then you will “Load” the file into HDFS and Hadoop will spit out each word in the book and its count. One of my favorites in an inverted index example which mimic a search engine, the results give you the word and the name of the books it’s in or the count of the books it is in. NOTE: The OS file system is completely different from the Hadoop File System, HDFS.

Now onto the Fun Stuff

In Part III, I answered the question: “What do developers do after shipping ANY software product?”  The answer was: “They develop the interfaces and tools to make it usable!!” Why? Because Version 1 is almost always about entering the market. All the free education training resources? They are debug tools, workarounds, and persuasion to try and keep those developers around.

I once heard Bill Gates say “We [Microsoft] excel at abstracting complexity. Take any task in Excel for example, go back 2-3 years, and compare the time difference it takes to achieve the same results!”. The statement stuck to this day because it is very important. It’s about Buy-versus-Build, in-house IT specialization decisions, IT investment choices, and so on. We continue to debate all of these arguments to this date.

And Hadoop is no different and will not be!! There is a race going on right now of who can provide a better abstraction or more specialized-purpose utility layer on top of Hadoop MapReduce, unstructured data, HDFS, all through utilities such as Hive, HBase, Pig, Flume, Sqoop, Ambari, Kafka. Mahout, YARN, Oozie, Sqoop, Storm, ZooKeeper, Phoenix just to mention a few. You can details or summaries of each at Http://Apache.org Now, the reasons for such progress which is ironic when progress is to make something work with ease or work, period. The reasons for the Progress:

Reason # 1 – Math, Statistics, and Statistical Modeling

Software providers know how difficult it is for the lay programmer who probably graduated with an Art History or Psychology Major (who permeated the software industry long ago) cannot write a Logistic Regression Model, or build a model from scratch to save his life. The results: Build him/her a utility. Remember VB?! The OO community was on the brink of rioting at 1 Microsoft Way, Redmond, WA for such a product which had nothing to do with OO (well, a little bit), however, it did take off. When I asked my professor in college about it (he was a brainiac, one of the originals on IBM’s Big Blue Chess program), he answered, my grandmother can write code, but she cannot optimize it, so stick to the optimization part!!

Reason # 2 – MapReduce is just plain difficult and alienates many data analysts with invaluable domain knowledge

This is where companies are already using technologies like Pig and Hive, Impala, and others software’s purpose is to eliminate the program having to transfer the math solution into a Java MapReduce job! The problem: Good luck debugging the generated code, in addition to the more dependence (vendor lock-in) on the major players in support contracts so bye-bye open source free software you touted to your boss, hello maintenance contracts, and time to revamp the resume! JK.. (I did not mention products such as Datameer or RedPoint or LigaData, and the list is very long)

Reason # 3 – It’s so Buggy, You Have to Start in the Cloud

This is a no-brainer. Believe me. Don’t even think about building an on-premise 100 node cluster for testing out the need for Big Data at your company. You might extend from the cloud to on-premise but your strategy should be “Cloud-it-and-forget-it” to start with.

Honestly, Big Data this is the first application I have seen where your mode of execution should be Cloud-First and everything else later. So, if you someone asks you about your cloud strategy, now you can pretend that you actually have one!! Why pushing here? It’s because I can provision a Hortonworks HDInsight on Hadoop

Finally, software choices: I mentioned the vendors in Part III, but I will do it so I won’t lose you, the Reader,

    1. Microsoft’s HDInsight:
      • However, the Cloud version is a bliss!! Few minutes to create the Azure Storage Service, (free eBook here) and a few more to create the cluster with Hortonworks Hadoop running in a Linux VM which you interact with by using PowerShell. You NEED to learn PowerShell especially at the release of Windows 10, CMD.Com will no longer be supported!  Back to HDInsight: With Cloud Storage as a Service, you can spin up a Hortonworks image, where HDFS is actually Azure Blob storage, run your scenario, move it back to Storage you created earlier, and spin the Hortonworks cluster down. Total Cost: very little because all will work and if it doesn’t you have someone (Microsoft) to yell at. Cloudera offers similar functionality (discussed later). One more KA feature: There is an Emulator provided by Microsoft which you can try locally before shipping it off to Azure. IF you can install it. Seriously, let me know if it installs for you!!!
      • If you try the Hortonworks Data Platform installation Sandbox (VM), you get an error in a log file in a directory that does not exists so you don’t actually know what happened. But wait, there is more: You can install the Hortonworks software in a VM running Windows which will run a lot of JVM processes for each feature with APIs made available to Windows and the utilities supplied by Microsoft, one of which is Data Visualization, a very important aspect of any Hadoop installation. How is this done? Just add in PowerPivot, PowerQuery, any SQL BI utility, and install the Hive ODBC driver and you can connect to the data directly and run some awesome visualizations long after the Hadoop cluster is gone.
      • This is how Hortonworks and Microsoft bring back the alienated Data Analysts crowd. Come on: How many Analysts do you know who does not know Excel?!! Others do too, however, the tools are not Office!!

2 Cloudera:  Provides four flavors, straight-up cloud, KVM, VMWARE, and VirtualBox. It will install smoothly (The VirtualBox that I checked out), and the samples will work *Sometimes*! however, the ones which don’t work, TAKE my advice, leave them alone. You can debug all you want, but there are so many moving parts that unless you luck out, otherwise. Keep. On. Trucking. Since I have a KA Windows Server, I was able to convert the VirtualBox image to Hyper-V and it worked fine. I was actually able to do the same for all vendors.

3.Hortonworks: Also provides the same flavors as Cloudera, however, as I mentioned earlier, go with the cloud option if you can, but it’s Azure for the Microsoft-haters out there!!. Azure gives you a month free trial. I used the Hyper-V image because I have a KA Windows server. The ugly thing is that, you spin up the image, and you connect to it via a browser or SSH. Not the prettiest UI. I also tried to get the Windows Data Platform to install and work. After many and I mean many hours of trying, nothing worked. Cryptic error messages, bad version control, HDP-2.0.6, 2.2.0, 2.4.2.2, and so many more! I have to point out that there is a Hortonworks for Azure VM Template which you “Quick Create” as mentioned above, and an HDInsight Hadoop on Linux  (cloud and Sandbox with an SSH or browser access” access), and a Hortonworks Data Platform MSI package for Windows which will install on a Windows Server but the prerequisites are too much and if something break, you have nowhere to look first. This is the Windows distro can be installed directly on a Windows VM Server box as well and is available from Azure.

4. .MapR,  NOTHING worked for me. Period. I tried debugging but felt like there was no critical mass out there to answer questions. I ran out of time. I just moved on…. I do need to mention that recently MapR has joined forces with Amazon and now offers Hadoop VM Service on Amazon and they offer a “decent” virtual academy lessons. Unfortunately, the free videos are very limited.

5.Amazon EMR: It Costs money to try! Unlike Microsoft and the other vendors who provide you with a “Sandbox” Amazon does not and with my super low budget, I did not try it. I will try it soon enough though. Amazon is the pioneer of Cloud Services and a very innovative company wo want to rent out the massive data and compute they have plenty of. It’s platform independent and offers everything possible. EMR and the entire Hadoop functionality is available as a set of APIs. Developers Rule!!

6.Finally, the final option is a DYI one which I also experimented with straight from the Apache Hadoop site built on CentOS and the latest Hadoop Distro. It was a nightmare. If anything breaks good luck debugging it or finding the source of the problem. Remember this is open source and no one is obligated to document anything if they don’t feel like it that day!! You are at the mercy of history (i.e. someone had the problem, solved and shared it), or someone in the community who will jump in to help you.

For any project plan you come up with for a Big Data project, in your project plan add a resource, and name him Google or Bing based on the flavor of your implantation and access to support, and assign him tasks such as research, Books, Professor, because he will save your life.

The Last Word:;

Remember the graph I drew up with the required skills for Data Scientist? Well, I made a few modification and thought I’d share it with you, my friends on LinkedIn so if you are starting out, you know which are the important areas to start with first (usually the hard ones)..

Big Data Skills

 

Cheers,

Bash

 

Bash Badawi has over 24 years of software development experience and is currently actively looking for a home to contribute his knowledge. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. He welcomes all feedback, questions, request for consultations, as he is an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part II

In Part I, I gave an overview of Big Data, what it is, and what are the (ideal) skills (a lot!) you need to get started and named some of the key players. In this post, I will cover some of the most important technologies, some Big Data applications examples and the architecture which underlies the Hadoop/Big Data Framework.

A Historical Background

This paragraph should have been in the first post. However, it is worth mentioning for those who are like myself, people who want to know how “things came to be” as opposed to “they are just here so deal with it!!”

It all started at Google while searching for a solution in trying to index the web. In a very simplistic view, the engineers at Google aimed to put together a (Key, Value) pair of words and the URLs of the sites containing those words. It sounds simple on paper, however, once you consider the sheer volume of sites, the complexity because of the volume of data begin to crystalize.

Soon after Google successfully designed the two main components behind such an effort, the Google Distributed File System, and the MapReduce framework and moving moved past them so they are no longer a trade secret probably, they published a paper on the technology used to index the internet! The uniqueness of Google’s problem lies in the fact that the solution required horizontal scalability (i.e. many nodes doing smaller tasks) as opposed to the horizontal scale-up (i.e. more and more beefed up servers with more memory, etc.

Once Google let the proverbial cat out of the hat, two guys at Yahoo! Had a massive a-ha moment and developed Hadoop (Doug Cutting and Mike Cafarella in 2005, Wikipedia) who were working on a similar problem had a massive a-ha moment and developed Hadoop starting with the Hadoop Distributed File System (HDFS) and the MapReduce framework and started the implementation of what resembled the Google approach.

So, what are the Core Hadoop Components?

There are two main component at the code of Hadoop. Although there are many tools available, their purpose is to facilitate the interaction of those two component.

Hadoop Distributed File System (HDFS)

HDFS is not a file system like FAT or NTFS. It’s a layer which sits atop the traditional file system and handles loading, distributing and replication of files.  What you need to know about HDFS is that when you want to add a node to your Hadoop cluster, you have to load HDFS so it can participate in the cluster operations. It’s like the File Allocation Table (FAT). HDFS splits up big Hadoop jobs and distributes them to all the Nodes in a duplicate fashion, i.e. a chunk of data would be sent to 2-3 nodes at the time. Why? Because Hadoop was designed for failure! Sounds strange? I am squeeze in a paragraph or so about Functional Programming where the Map function closely resembles function programming paradigm.

Primary and Secondary Name Nodes

As with every file systems, you need to know where stuff is located. HDFS implements this by designating a Hadoop server as a primary Name Node which keeps an inventory of where data is located such that if a node decides to take a vacation, other nodes will step in and replace it.

MapReduce

MapReduce is the essence of a Hadoop system. It is the business logic translating statistical models or execution plan to achieve what we had set out to do, answer questions. The logic is usually encapsulated in a JAVA application with a, you guessed it, a Map and Reduce function specifying what data types to expect, what to do with outliers and missing data, and so forth.

Here is a snapshot of what a Word Count Java application would look like. In a nutshell, the Mapper Tokenizes each word and counts it for a small chunk of the data. The Reducer takes the output of the Map which is many (Key, Value) pairs as (word, count) and aggregates the count. A part of the App is to configure the job to submit it to Hadoop which is something you can find out on your own.

Business Use

Obviously businesses do not build such massive systems without ample justification of how it will help them in solving current problem (bottom and top line) and to plan for solution for mid to long term issues they face. They aim to use the massive parallel compute power at their disposal to answer questions about customer or system behavior using large amounts of structured, semi-Structured and unstructured data. Questions like a Cable or Mobile Phone Companies asking itself what are people not renewing their contracts (called churn rate)? Or a large Agricultural Co-op deciding on the amount of a certain fruit or vegetable to grow for the next season based on data other than the future contract prices from the stock exchange), or a local police force trying to understand why is the crime rate is high at a certain location and so forth.

So, Now What? What’s the Big Deal?

So now we have this great framework being used by Internet giants such as Facebook and so on, where people like you and me who work for consulting companies position Big Data and the benefits. The answer comes from one my favorite researchers/professors at MIT whom I studied his work with zeal while in Economics School. Erik Brynjolfsson, studied the performance of companies that excel at data-driven decision-making and compared it with the performance of other firms. And guess what the outcome was? They found that productivity levels were as much as 6 percent higher at such firms than at companies that did not emphasize using data to make decisions which did not surprise me at all.

So as a business executive, entrepreneur, consulting firm, or a software provider, there is a huge cost for doing nothing! Case closed!! I would really like to hear your thoughts on what you would like to see in Part III. Right now, I am planning to highlight some of the technologies which makes the user of Hadoop easier, especially for Data Analysts, however, I am open for ideas. Hope you enjoyed the post.

Cheers,

Bash Badawi

Open source software, Microsoft Strategy and the dynamics of multi-sided platforms.

You cannot beat FREE!… The Free Software movement shook the foundation of the software industry and posed a serious challenge to the very essence of capitalism in this industry. It forced vendors to offer “basic” versions of their software as an open-source, free license entry point alternative.

Microsoft is one of those vendors. Two of the severely under-advertised are two Microsoft programs for Students and Start-ups to get full version software is DreamSpark and BizSpark. It works like this: If you are a student and your school has an agreement with Microsoft, you can get a lot of free Microsoft software ranging from Visual Studio Pro to Server 2012 R2 and SQL Server 2012, Operating Systems and a load of other goodies to get you on your way to develop software for the traditional desktop, cloud, App Stores (both Windows and Mobile)…

The process is very simple. Just get to the dreamspark.com site and check if your institution has a participant agreement in place and you are set to go. DreamSpark works in a similar fashion, however, it is geared towards start-ups with a preset revenue limit and a number of developers. The cost is a few hundred dollars. This is kind of Microsoft’s strategy to influence the incubation phase of a business in an attempt to try and steer them clear of the LAMP stack.

This phenomena is not new. Most people are unaware of lowering of market entry cost by any software vendor is a very well-known strategy in any multi-sided platform business model. It is very well studied and researched in the gaming industry where the entry cost of game developers is heavily subsidized by the gaming/software platform company or sometimes the reverse is true. Of course, the subsidy is supported indirectly by the customer who always ends up being the party who ends up paying for the cost of the subsidy given to the game/software developers.

Depending on what side of the platform you are on, having a strategy is very important especially if you can pool your industry peers together to garner some buying/negotiating power. Sometimes though, being the customer is most disadvantageous position in the multi-sided platform game. Gaming companies have been lowering entry costs steadily for the past few years and if you are a product manager, there are quite a few lessons to be applied from such industry as well as a the open-source/license movement to move your wares… TBC

Software and the human mind: Is there an Evolutionary Gap?

I have done this experiment a while ago where I asked people what is software? It was a lot of fun and I recommend that you try it. I found out that if you are a programmer, you think its code, an architect, you are thinking abstraction layers, connectors, models. Customers however think cost, will it do what they want without disrupting their business? And then I asked my mother, her answer was: “software is what makes the computer whirr and allows me to check out her friends on Facebook, Skype and watch my favorite soaps” (which I found amazing insofar as it was the closest to reality answer!)

Software is like someone telling you, the shop you are looking for is about half-a-mile walking distance. I find these (directions) to be the funniest since you have to have walked to that shop many times and measured the distance before you can be even close to accurate as to where the shop is (or used Google Maps!) But even with maps, for the majority of us, our human minds cannot really visualize a mile or a mile-and-a=half or two. We are simply not equipped to conceptualize distance, which is the same with Software.

It’s difficult for me to believe that the human mind has totally evolved to fully understand Software Engineering, as it does understand say Civil Engineering. I believe this where most of the problems affecting our industry comes from.

In general, there is no one definition in reality that matches the one stating that a program is a “set of instruction performed on data to transform it into a desirable output”. And this is where and why we get into trouble. Say you are estimating a project, each person on the project/product team will have a different context associated with the definition of software. OK, they might be close, but there will always be subtle yet sometimes detrimental differences.

So what’s the best approach to solve this problem? I can think of a few: use a framework/methodology to as a contract to maintain perception cohesion. Also, you can keep a core team to use frequently as repetition can potentially cause convergence of perception and help in overcoming our evolutionary deficiencies. This approach will result in consistency, but not necessarily creativity. For that, you have to mix up the team and go through the pains I started this post with!!

Do you agree? It’d be great to hear about your workaround our yet-to-evolve human brain.