Tag Archives: Cloudera

Big Data: Vendor Review, Prerequisites and How to Crack it quickly – Part III

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part III

Well, sadly, we have come to the end of the topic which I have tremendously enjoyed (unless of course I change my mind and can think of something compelling to do a part 4). In part I and part II, I discussed what is Big Data, what you needed to know among other things. In Part III, I will start the post with a questions: What do software engineers/communities/companies do after they ship software? I say Engineers/Communities/companies since the lines are so blurred now we don’t know who is who!, you know, open source, services-only, countries backing open source initiatives, etc). I will provide the answer at the end of the post for a suspenseful effect.

The purpose of this post is that I will tell you what else is out there, the ugly stuff to expect and how to get started:

  1. Expectation: Don’t expect a lot!! Meaning: DON’T EXPECT THINGS TO WORK right out-of-the-box. You will run into every conceivable problem you can think of whether it is a built-in feature or add-on
  2. How to get started (it’s very hard, you’ll see what I mean). The first thing is Cloudera was probably the least problematic and some features actually worked (after some long hours of debugging). Hortonworks was my number 2. The cool thing about Hortonworks is that they offer a Linux and a Windows version. (I think Cloudera is about to offer the same) The Hortonworks Windows version, called Azure HDInsight, uses Blob storage on Azure instead of HDFS (pretty smart from Microsoft to keep Windows relevant in the space). Here is a link to the Hortonworks Architecture image below for both stacks.

HortonCross-Plat

Why is the Microsoft Platform so important? IMHO, because data visualization is an important part of the overall Hadoop skillset. Presenting the information might make or break the project. So why is Microsoft Office so important here, it’s because has some powerful data visualization tools (Power Pivot, Power Query, Excel) which will ensure the buy-in and inclusion of the business users and most importantly, the data analysts who like to not just view data but also run queries over Hive ODBC.

Additionally, and there is a huge plus for the Microsoft-Hortonworks partnership, it’s the tight integration of Microsoft Center with Hortonworks’s Hadoop as per the image below (think of hundreds of nodes), link here. Other vendors offer a form of monitoring, however, the UI is nothing like System Center.

HadoopSysCntr

There are other players in the field I did not mention such as:

HadoopUsage

Image by way of http://thirups.blogspot.com/2013/04/beyond-buzz-big-data-and-apache-hadoop.html and http://2.bp.blogspot.com/ as well.

As you can see, the market place is way overpopulated which means it is ripe for acquisitions/consolidation. I bet you if I republish this post a year from now, the picture above will be just a hand-full of market agents.

GartnerHyCy

Figure 1 Image is Gartner Research Hype Cycle for Big Data 2014 by way of http://www.Clickinsight.ca. Link is here

Additionally, based on the Gartner Hype Cycle, I do believe we are somewhere between the Peak and Climbing the Slope. It depends on which Sector you examine (e.g. telco is ahead of transportation)

You might ask, how and where do I start? I need to acquire those skills fast to cash-in on this lucrative trend! My suggestion is to start with major Big Data vendors I mentioned in Part I of the series. They all have Sandboxes, or Cloud images for training purposes.

Therefore, start with the Virtual Machines!, Cloudera and the Cloudera Sandbox, Hortonworks on—premise Sandbox, Hortonworks Azure HDInsight cloud startup is a snap to get up and running Amazon EMR (Elastic MapReduce) , MapR, or Do-it-Yourself (DIY) for the brave-hearted (Udemy & Coursera have good video courses on how to get started from scratch. Good luck mate!)

I have had major pains in all of them. They were so buggy that I deleted and started over for each one of them.

I recommend that you start with a Cloud-based Sandbox in the order mentioned above then try the On-prem. Now, you can build your own, from scratch, but unless you are a Linux- Java-I-can-solve-everything guy, DON’T! You will be wasting lots of time doing something someone else did, i.e. built you a nice VirtualBox or VMWARE or Hyper-V VM and debugged it all. That doesn’t mean you won’t have to debug things yourself. Believe me, you will run into a problem somewhere and have to debug the “ready” image or cloud. If you can recreate the cluster on the cloud, just do it.

The answer to the question: What do software engineers/communities/companies do after they ship software? They make it work by starting minor releases which did not make the cut, add tools they used to debug the build itself, and add tools to make it easier for the user to manage and use the system. Tools such as: Hive, Pig, Impala, Sqoop, Cloudera Manager, Mahout, Flume, Storm, Spark, ZooKeeper, Ambari, HBase, Splunk, and some other tools I have yet to come in contact with.

Cheers,

Bash Badawi

Bash Badawi has over 24 years of technology experience ranging from an internship at NASA to providing IT advisory services in over 30 countries around the globe in Europe, Africa, Asia and the Americas advising governments to adopt an Open Data Policy. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. I welcome all feedback as I am an avid reader and learner.

Advertisements