Tag Archives: big data

The Big Data Posts: The Last Word….and The Fun Ones!!

Sorry, but I couldn’t let go =). Seriously though, I have some more information I felt that I have to share with my friends at LinkedIn. Information which are more like “Lessons’ Learned” on what NOT to do and what NOT to expect with Hadoop and Big Data on Day 1 forward!

Hello World!!

What is Hadoop’s equivalent of “Hello World”? The answer is Word Count. Why so? Because Word Count (where I showed a code shot of a MapReduce Word Count in Part II the code is supported by every vendor. In the Hello World Tutorial, you will download a book from project Gutenberg or any copyright-expired books from Google Books (Cloudera comes with Shakespeare’s Romeo and Juliet in .TXT format) then you will “Load” the file into HDFS and Hadoop will spit out each word in the book and its count. One of my favorites in an inverted index example which mimic a search engine, the results give you the word and the name of the books it’s in or the count of the books it is in. NOTE: The OS file system is completely different from the Hadoop File System, HDFS.

Now onto the Fun Stuff

In Part III, I answered the question: “What do developers do after shipping ANY software product?”  The answer was: “They develop the interfaces and tools to make it usable!!” Why? Because Version 1 is almost always about entering the market. All the free education training resources? They are debug tools, workarounds, and persuasion to try and keep those developers around.

I once heard Bill Gates say “We [Microsoft] excel at abstracting complexity. Take any task in Excel for example, go back 2-3 years, and compare the time difference it takes to achieve the same results!”. The statement stuck to this day because it is very important. It’s about Buy-versus-Build, in-house IT specialization decisions, IT investment choices, and so on. We continue to debate all of these arguments to this date.

And Hadoop is no different and will not be!! There is a race going on right now of who can provide a better abstraction or more specialized-purpose utility layer on top of Hadoop MapReduce, unstructured data, HDFS, all through utilities such as Hive, HBase, Pig, Flume, Sqoop, Ambari, Kafka. Mahout, YARN, Oozie, Sqoop, Storm, ZooKeeper, Phoenix just to mention a few. You can details or summaries of each at Http://Apache.org Now, the reasons for such progress which is ironic when progress is to make something work with ease or work, period. The reasons for the Progress:

Reason # 1 – Math, Statistics, and Statistical Modeling

Software providers know how difficult it is for the lay programmer who probably graduated with an Art History or Psychology Major (who permeated the software industry long ago) cannot write a Logistic Regression Model, or build a model from scratch to save his life. The results: Build him/her a utility. Remember VB?! The OO community was on the brink of rioting at 1 Microsoft Way, Redmond, WA for such a product which had nothing to do with OO (well, a little bit), however, it did take off. When I asked my professor in college about it (he was a brainiac, one of the originals on IBM’s Big Blue Chess program), he answered, my grandmother can write code, but she cannot optimize it, so stick to the optimization part!!

Reason # 2 – MapReduce is just plain difficult and alienates many data analysts with invaluable domain knowledge

This is where companies are already using technologies like Pig and Hive, Impala, and others software’s purpose is to eliminate the program having to transfer the math solution into a Java MapReduce job! The problem: Good luck debugging the generated code, in addition to the more dependence (vendor lock-in) on the major players in support contracts so bye-bye open source free software you touted to your boss, hello maintenance contracts, and time to revamp the resume! JK.. (I did not mention products such as Datameer or RedPoint or LigaData, and the list is very long)

Reason # 3 – It’s so Buggy, You Have to Start in the Cloud

This is a no-brainer. Believe me. Don’t even think about building an on-premise 100 node cluster for testing out the need for Big Data at your company. You might extend from the cloud to on-premise but your strategy should be “Cloud-it-and-forget-it” to start with.

Honestly, Big Data this is the first application I have seen where your mode of execution should be Cloud-First and everything else later. So, if you someone asks you about your cloud strategy, now you can pretend that you actually have one!! Why pushing here? It’s because I can provision a Hortonworks HDInsight on Hadoop

Finally, software choices: I mentioned the vendors in Part III, but I will do it so I won’t lose you, the Reader,

    1. Microsoft’s HDInsight:
      • However, the Cloud version is a bliss!! Few minutes to create the Azure Storage Service, (free eBook here) and a few more to create the cluster with Hortonworks Hadoop running in a Linux VM which you interact with by using PowerShell. You NEED to learn PowerShell especially at the release of Windows 10, CMD.Com will no longer be supported!  Back to HDInsight: With Cloud Storage as a Service, you can spin up a Hortonworks image, where HDFS is actually Azure Blob storage, run your scenario, move it back to Storage you created earlier, and spin the Hortonworks cluster down. Total Cost: very little because all will work and if it doesn’t you have someone (Microsoft) to yell at. Cloudera offers similar functionality (discussed later). One more KA feature: There is an Emulator provided by Microsoft which you can try locally before shipping it off to Azure. IF you can install it. Seriously, let me know if it installs for you!!!
      • If you try the Hortonworks Data Platform installation Sandbox (VM), you get an error in a log file in a directory that does not exists so you don’t actually know what happened. But wait, there is more: You can install the Hortonworks software in a VM running Windows which will run a lot of JVM processes for each feature with APIs made available to Windows and the utilities supplied by Microsoft, one of which is Data Visualization, a very important aspect of any Hadoop installation. How is this done? Just add in PowerPivot, PowerQuery, any SQL BI utility, and install the Hive ODBC driver and you can connect to the data directly and run some awesome visualizations long after the Hadoop cluster is gone.
      • This is how Hortonworks and Microsoft bring back the alienated Data Analysts crowd. Come on: How many Analysts do you know who does not know Excel?!! Others do too, however, the tools are not Office!!

2 Cloudera:  Provides four flavors, straight-up cloud, KVM, VMWARE, and VirtualBox. It will install smoothly (The VirtualBox that I checked out), and the samples will work *Sometimes*! however, the ones which don’t work, TAKE my advice, leave them alone. You can debug all you want, but there are so many moving parts that unless you luck out, otherwise. Keep. On. Trucking. Since I have a KA Windows Server, I was able to convert the VirtualBox image to Hyper-V and it worked fine. I was actually able to do the same for all vendors.

3.Hortonworks: Also provides the same flavors as Cloudera, however, as I mentioned earlier, go with the cloud option if you can, but it’s Azure for the Microsoft-haters out there!!. Azure gives you a month free trial. I used the Hyper-V image because I have a KA Windows server. The ugly thing is that, you spin up the image, and you connect to it via a browser or SSH. Not the prettiest UI. I also tried to get the Windows Data Platform to install and work. After many and I mean many hours of trying, nothing worked. Cryptic error messages, bad version control, HDP-2.0.6, 2.2.0, 2.4.2.2, and so many more! I have to point out that there is a Hortonworks for Azure VM Template which you “Quick Create” as mentioned above, and an HDInsight Hadoop on Linux  (cloud and Sandbox with an SSH or browser access” access), and a Hortonworks Data Platform MSI package for Windows which will install on a Windows Server but the prerequisites are too much and if something break, you have nowhere to look first. This is the Windows distro can be installed directly on a Windows VM Server box as well and is available from Azure.

4. .MapR,  NOTHING worked for me. Period. I tried debugging but felt like there was no critical mass out there to answer questions. I ran out of time. I just moved on…. I do need to mention that recently MapR has joined forces with Amazon and now offers Hadoop VM Service on Amazon and they offer a “decent” virtual academy lessons. Unfortunately, the free videos are very limited.

5.Amazon EMR: It Costs money to try! Unlike Microsoft and the other vendors who provide you with a “Sandbox” Amazon does not and with my super low budget, I did not try it. I will try it soon enough though. Amazon is the pioneer of Cloud Services and a very innovative company wo want to rent out the massive data and compute they have plenty of. It’s platform independent and offers everything possible. EMR and the entire Hadoop functionality is available as a set of APIs. Developers Rule!!

6.Finally, the final option is a DYI one which I also experimented with straight from the Apache Hadoop site built on CentOS and the latest Hadoop Distro. It was a nightmare. If anything breaks good luck debugging it or finding the source of the problem. Remember this is open source and no one is obligated to document anything if they don’t feel like it that day!! You are at the mercy of history (i.e. someone had the problem, solved and shared it), or someone in the community who will jump in to help you.

For any project plan you come up with for a Big Data project, in your project plan add a resource, and name him Google or Bing based on the flavor of your implantation and access to support, and assign him tasks such as research, Books, Professor, because he will save your life.

The Last Word:;

Remember the graph I drew up with the required skills for Data Scientist? Well, I made a few modification and thought I’d share it with you, my friends on LinkedIn so if you are starting out, you know which are the important areas to start with first (usually the hard ones)..

Big Data Skills

 

Cheers,

Bash

 

Bash Badawi has over 24 years of software development experience and is currently actively looking for a home to contribute his knowledge. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. He welcomes all feedback, questions, request for consultations, as he is an avid reader and learner.

Advertisements