Category Archives: Software and Economics in Industry

How the Azure Analytics (R)evolution will Alter the Hard Truth about Data Science

If you are in technical sales or trying to break into Big Data or Data Science this post is definitely a must-read for you. It is based on real-life examples, research and analysis from a practitioner stand-point..

There is a sad but realistic truth about Data Science (DS henceforth) and Big Data (BD). There is an unfortunate dichotomy in this space:  Those who are on the inside track (the Insiders), and those who are trying to get in (theOutsiders).  In the world of the “Insiders” there exists a persistent message which garnered support amongst employers which amounts to touting the Outsiders and raising the entry barrier. How? By raising the number of skills required. I came up with a list of must-have skills for a BD or a DS role based on my own analytics exercise on hundreds of DS/BD job descriptions I gathered from a web scraping. So guess how many skills are required? Twenty two! Setting the bar at least at twenty two skills, and counting, is ridiculously unrealistic!

 Enter Microsoft!!

The will-be DS/BD game changer happened quietly with not much coverage in the tech media world. This was Microsoft’s acquisition of Revolution Analytics. So, how does the Microsoft acquisition have to do with Outsiders wanting to get into Data Science? A Lot!! Let me explain.

The one fundamental skill which is a must-have in DS is Statistics! And what does the DS community use as a Statistics programming language? R!.Here is where the story gets interesting.

A branch of DS is Machine Learning or Predictive Analysis. This is the true value-add for any BD initiative: to be able to make a data-driven decision about future strategy.

To do this in R, the very popular Analytics language, you need to go through training, etc, become an expert before you can attempt to get close to solving problems. R is archaic. It is a reincarnation of languages called S and S-Plus and it is 40-years old.

The great news is that Microsoft just acquired Revolution Analytics (hence the post title). Revolution R (a product of Revolution Analytics) is known for enabling parallel processing of R resulting in massive performance gains.

Today, R is supported in Azure Machine Learning in its current format. So why does this matter? The reason is very simple. There is talk that Microsoft will probably take R, modernize it and make it a first class language, add it to Visual Studio with Intellisense support and allow everyone to develop Azure ML solution in Visual Studio ready to  be published directly to Azure. The net effect:

  •  It will be nothing short of a coup for the Outsiders in the DS/BD field who can now develop solutions without the need for the massive R learning curve cost.
  • Azure ML already provides a lot of the R functionality today out-of-the-box so it’s a natural extension to the existing functionality
  • Azure ML makes it very simple to share code, models, and common problem solutions thus:
  • I will get my solutions at 10X time-to-market
  • I can profit from my work if I choose to publish it in the Azure Gallery
  • It will change the market dynamics as it will increase the short supply of talent of BD/DS
  • And will unleash the genius of statisticians, business users with minimal programming experience to conduct their own experiments.
  • Coupled with HDInsight which has access to Analytics APIs, this is the more efficient BD solution on the market today

 

What’s more: Keep an eye out on the space. As I mentioned, if you are in Microsoft technical sales, this will be your ticket to moving large enterprises onto Azure with a strategy rather than just a plain tactical need!! If you are evaluating DS technology today, you probably know very little about Azure ML and the functionality provided. I would highly encourage to evaluate Azure ML before heading to a Cloudera or MapR or any other Hadoop vendors. You will not be disappointed!

 

About Me:

I am a freelance consultant with over 24 years of experience in IT, strategy and Economics. I specialize in Cloud, Data Architecture and DS, Machine Learning, corporate strategy and provide architectural consulting, training, technical research, data-driven decision making solutions based on economics-based statistical methods grounded in scientific frameworks.

The 22 Skills List of Data Science:

  1. R Programming
  2. Getting and Cleaning Data
  3. Exploratory Data Analysis
  4. Reproducible Research
  5. Statistical Inference
  6. Regression Models
  7. Practical Machine Learning
  8. Developing Data Products
  9. Data Visualization
  10. DBA
  11. Hadoop (including Azure HD Insight technology stack
  12. Orchestrate data workflows
  13. Data ingestion/curation using Pig, Hive, Sqoop or other Hadoop tools
  14. Hadoop cluster configuration using Hadoop big data architecture
  15. High-level design using Business Analysis, Microsoft Azure Platform Knowledge, Blob Storage API Knowledge
  16. Blob Storage API Knowledge
  17. Metadata management tool.
  18. Model client data
  19. Mapping
  20. Data profiling – Information analyzer/Excel preferred
  21. Decide how data is going to be used to make decisions, and
  22. Knowledge of both tools and methods from statistics, machine learning, software engineering, as well as being human and show persistence

Big Data: Vendor Review, Prerequisites and How to Crack it quickly – Part III

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part III

Well, sadly, we have come to the end of the topic which I have tremendously enjoyed (unless of course I change my mind and can think of something compelling to do a part 4). In part I and part II, I discussed what is Big Data, what you needed to know among other things. In Part III, I will start the post with a questions: What do software engineers/communities/companies do after they ship software? I say Engineers/Communities/companies since the lines are so blurred now we don’t know who is who!, you know, open source, services-only, countries backing open source initiatives, etc). I will provide the answer at the end of the post for a suspenseful effect.

The purpose of this post is that I will tell you what else is out there, the ugly stuff to expect and how to get started:

  1. Expectation: Don’t expect a lot!! Meaning: DON’T EXPECT THINGS TO WORK right out-of-the-box. You will run into every conceivable problem you can think of whether it is a built-in feature or add-on
  2. How to get started (it’s very hard, you’ll see what I mean). The first thing is Cloudera was probably the least problematic and some features actually worked (after some long hours of debugging). Hortonworks was my number 2. The cool thing about Hortonworks is that they offer a Linux and a Windows version. (I think Cloudera is about to offer the same) The Hortonworks Windows version, called Azure HDInsight, uses Blob storage on Azure instead of HDFS (pretty smart from Microsoft to keep Windows relevant in the space). Here is a link to the Hortonworks Architecture image below for both stacks.

HortonCross-Plat

Why is the Microsoft Platform so important? IMHO, because data visualization is an important part of the overall Hadoop skillset. Presenting the information might make or break the project. So why is Microsoft Office so important here, it’s because has some powerful data visualization tools (Power Pivot, Power Query, Excel) which will ensure the buy-in and inclusion of the business users and most importantly, the data analysts who like to not just view data but also run queries over Hive ODBC.

Additionally, and there is a huge plus for the Microsoft-Hortonworks partnership, it’s the tight integration of Microsoft Center with Hortonworks’s Hadoop as per the image below (think of hundreds of nodes), link here. Other vendors offer a form of monitoring, however, the UI is nothing like System Center.

HadoopSysCntr

There are other players in the field I did not mention such as:

HadoopUsage

Image by way of http://thirups.blogspot.com/2013/04/beyond-buzz-big-data-and-apache-hadoop.html and http://2.bp.blogspot.com/ as well.

As you can see, the market place is way overpopulated which means it is ripe for acquisitions/consolidation. I bet you if I republish this post a year from now, the picture above will be just a hand-full of market agents.

GartnerHyCy

Figure 1 Image is Gartner Research Hype Cycle for Big Data 2014 by way of http://www.Clickinsight.ca. Link is here

Additionally, based on the Gartner Hype Cycle, I do believe we are somewhere between the Peak and Climbing the Slope. It depends on which Sector you examine (e.g. telco is ahead of transportation)

You might ask, how and where do I start? I need to acquire those skills fast to cash-in on this lucrative trend! My suggestion is to start with major Big Data vendors I mentioned in Part I of the series. They all have Sandboxes, or Cloud images for training purposes.

Therefore, start with the Virtual Machines!, Cloudera and the Cloudera Sandbox, Hortonworks on—premise Sandbox, Hortonworks Azure HDInsight cloud startup is a snap to get up and running Amazon EMR (Elastic MapReduce) , MapR, or Do-it-Yourself (DIY) for the brave-hearted (Udemy & Coursera have good video courses on how to get started from scratch. Good luck mate!)

I have had major pains in all of them. They were so buggy that I deleted and started over for each one of them.

I recommend that you start with a Cloud-based Sandbox in the order mentioned above then try the On-prem. Now, you can build your own, from scratch, but unless you are a Linux- Java-I-can-solve-everything guy, DON’T! You will be wasting lots of time doing something someone else did, i.e. built you a nice VirtualBox or VMWARE or Hyper-V VM and debugged it all. That doesn’t mean you won’t have to debug things yourself. Believe me, you will run into a problem somewhere and have to debug the “ready” image or cloud. If you can recreate the cluster on the cloud, just do it.

The answer to the question: What do software engineers/communities/companies do after they ship software? They make it work by starting minor releases which did not make the cut, add tools they used to debug the build itself, and add tools to make it easier for the user to manage and use the system. Tools such as: Hive, Pig, Impala, Sqoop, Cloudera Manager, Mahout, Flume, Storm, Spark, ZooKeeper, Ambari, HBase, Splunk, and some other tools I have yet to come in contact with.

Cheers,

Bash Badawi

Bash Badawi has over 24 years of technology experience ranging from an internship at NASA to providing IT advisory services in over 30 countries around the globe in Europe, Africa, Asia and the Americas advising governments to adopt an Open Data Policy. Mr. Badawi can be reached on his LinkedIn profile or over email at techonomist@hotmail.com. I welcome all feedback as I am an avid reader and learner.

Big Data: Vendor Review, Prerequisites and How to Crack it Quickly – Part II

In Part I, I gave an overview of Big Data, what it is, and what are the (ideal) skills (a lot!) you need to get started and named some of the key players. In this post, I will cover some of the most important technologies, some Big Data applications examples and the architecture which underlies the Hadoop/Big Data Framework.

A Historical Background

This paragraph should have been in the first post. However, it is worth mentioning for those who are like myself, people who want to know how “things came to be” as opposed to “they are just here so deal with it!!”

It all started at Google while searching for a solution in trying to index the web. In a very simplistic view, the engineers at Google aimed to put together a (Key, Value) pair of words and the URLs of the sites containing those words. It sounds simple on paper, however, once you consider the sheer volume of sites, the complexity because of the volume of data begin to crystalize.

Soon after Google successfully designed the two main components behind such an effort, the Google Distributed File System, and the MapReduce framework and moving moved past them so they are no longer a trade secret probably, they published a paper on the technology used to index the internet! The uniqueness of Google’s problem lies in the fact that the solution required horizontal scalability (i.e. many nodes doing smaller tasks) as opposed to the horizontal scale-up (i.e. more and more beefed up servers with more memory, etc.

Once Google let the proverbial cat out of the hat, two guys at Yahoo! Had a massive a-ha moment and developed Hadoop (Doug Cutting and Mike Cafarella in 2005, Wikipedia) who were working on a similar problem had a massive a-ha moment and developed Hadoop starting with the Hadoop Distributed File System (HDFS) and the MapReduce framework and started the implementation of what resembled the Google approach.

So, what are the Core Hadoop Components?

There are two main component at the code of Hadoop. Although there are many tools available, their purpose is to facilitate the interaction of those two component.

Hadoop Distributed File System (HDFS)

HDFS is not a file system like FAT or NTFS. It’s a layer which sits atop the traditional file system and handles loading, distributing and replication of files.  What you need to know about HDFS is that when you want to add a node to your Hadoop cluster, you have to load HDFS so it can participate in the cluster operations. It’s like the File Allocation Table (FAT). HDFS splits up big Hadoop jobs and distributes them to all the Nodes in a duplicate fashion, i.e. a chunk of data would be sent to 2-3 nodes at the time. Why? Because Hadoop was designed for failure! Sounds strange? I am squeeze in a paragraph or so about Functional Programming where the Map function closely resembles function programming paradigm.

Primary and Secondary Name Nodes

As with every file systems, you need to know where stuff is located. HDFS implements this by designating a Hadoop server as a primary Name Node which keeps an inventory of where data is located such that if a node decides to take a vacation, other nodes will step in and replace it.

MapReduce

MapReduce is the essence of a Hadoop system. It is the business logic translating statistical models or execution plan to achieve what we had set out to do, answer questions. The logic is usually encapsulated in a JAVA application with a, you guessed it, a Map and Reduce function specifying what data types to expect, what to do with outliers and missing data, and so forth.

Here is a snapshot of what a Word Count Java application would look like. In a nutshell, the Mapper Tokenizes each word and counts it for a small chunk of the data. The Reducer takes the output of the Map which is many (Key, Value) pairs as (word, count) and aggregates the count. A part of the App is to configure the job to submit it to Hadoop which is something you can find out on your own.

Business Use

Obviously businesses do not build such massive systems without ample justification of how it will help them in solving current problem (bottom and top line) and to plan for solution for mid to long term issues they face. They aim to use the massive parallel compute power at their disposal to answer questions about customer or system behavior using large amounts of structured, semi-Structured and unstructured data. Questions like a Cable or Mobile Phone Companies asking itself what are people not renewing their contracts (called churn rate)? Or a large Agricultural Co-op deciding on the amount of a certain fruit or vegetable to grow for the next season based on data other than the future contract prices from the stock exchange), or a local police force trying to understand why is the crime rate is high at a certain location and so forth.

So, Now What? What’s the Big Deal?

So now we have this great framework being used by Internet giants such as Facebook and so on, where people like you and me who work for consulting companies position Big Data and the benefits. The answer comes from one my favorite researchers/professors at MIT whom I studied his work with zeal while in Economics School. Erik Brynjolfsson, studied the performance of companies that excel at data-driven decision-making and compared it with the performance of other firms. And guess what the outcome was? They found that productivity levels were as much as 6 percent higher at such firms than at companies that did not emphasize using data to make decisions which did not surprise me at all.

So as a business executive, entrepreneur, consulting firm, or a software provider, there is a huge cost for doing nothing! Case closed!! I would really like to hear your thoughts on what you would like to see in Part III. Right now, I am planning to highlight some of the technologies which makes the user of Hadoop easier, especially for Data Analysts, however, I am open for ideas. Hope you enjoyed the post.

Cheers,

Bash Badawi

How Applied Software Economics Can Solve Technology Providers Problems

In an earlier post, I talked about the role of Economics in the Software Industry, or lack thereof, and how I embarked on a journey to seek a Master’s Degree in Business Economics to seek out the truth of whether or not economics can be applied to software technology to solve some of the problems ailing the industry.

I pursued my economics degree in earnest to investigate whether or not there are potential benefits/explanations in applying an interdisciplinary approach and if the application of certain economic concepts could have the potential to positively impact the software industry. The software industry which is riddled with many infamous stories of epic failures (healthcare.gov!!), monopolistic behavior, collusion (see this amazing post on a secret non-poaching pact in Silicon Valley titans of technology),

All of the companies I worked for did not employ anyone with an Economics Degree (even if they did, that was not their primary job), just simply Finance! Almost the entire middle management layer are financial planners with a massive preoccupation of the 30/60/90 day budget/revenue planning/forecast, revenue attainment, etc, cycle which occurs, four times a year and they were busy making calls to folks in the field asking whether they will close the deal or not! You know the type and you almost always feel that while you are doing actual valuable work, they are just counting the beans you are bringing in!!

Anyway, to offset this massive resource drain of corporate resources of “Finance” layers, groups started sprouting up within the boundaries of the corporation and sometimes through a recommendation from an outside firm (a Management Consulting firm) with the groups’ primary focus is on Corporate Strategy, or mid to long-term planning, or sometimes the highly misnomer R&D!!. The cost of not doing so proved fatal to many companies who failed to “plan” to compete, or simply ignored to have a compete strategy as they were blindsided by smaller more nimble technology startups which overtook them. I bet you can name ten of those tech companies right now that no longer exist. You know, the software darlings of the times with the meteoric rise and speed-of-light fall.

There has been many books, research papers, etc, which study the failure of businesses. I had to read through quite a few in my recent University days (in Economics it’s called “Creative destruction” and many other terms related to Darwin’s theory of evolution). As an example, Apple and Microsoft are a great case study. At some point Apple’s “closed” business model almost cost it its own existence. However, when the PC world became the Wide Wild West of cheap components, buggy software, and so on, Apple’s business model forged ahead with a simple advantage “more stability and security”.

The economic principles behind both camps, Apple and Microsoft, were at the opposite end of the spectrum. I am talking hardware-wise. Apple was “locked, proprietary” and a “closed” ecosystem where MSFT was “Open”.

Enough theory and back to reality. If you are a business owner, executive, IT Manager, I hate to break the bad news to you: There is a cost to doing nothing! Please allow me to explain: You built a product or a service, now you are ready to sell it, how do you price it? Do you bundle it with another product? Do you follow the herd of free then premium? Build it, let will come, and then figure out a successful revenue model?! How about lock-in? Can you assure you customers they are not going to be locked in for the rest of the life of your software? The list is very long and I have yet to scratch the surface and what’s more scary is going at it alone without the aid of any theory which could be applied to address such issues.

For example of applied economics to solved project failure is an approach to treat software functionality completion as stock options where the maturity date is the completion date for that particular functionality (I borrowed this one from Barry Boehm). This is where a completion of say a report accounts for X number of options pre-agreed upon prior to the project start date. Missed the date: Your options are underwater and it’s time to focus on not missing the next cycle.

Ideas such as this one, grounded in economic theory can greatly improve performance, motivate people, and are extremely creative to solve chronic problems (how many times have you heard about projects missing deadlines with cost overruns, etc.)…. Just food for thought!

IT $pending: Does it affect your company’s size?

There is an age-old question which should be on the minds of tech folks and tech sales folks alike: What happens when firms spend money on Software Technology? It’s a deceptively simple question, however, when you dig deeper you find it one of the toughest questions to answer.

A little bit of history is in order here: The question was originally posed when an economist (Ronald Coase) questioned the wisdom of treating the Firm as a blackbox. Blackbox as where Input goes in one end, and output comes out the other with economists never paying attention to what happens inside the box. The answer was also part of my graduating Thesis at the University of Strathclyde, Department of Economics.

It’s a very interesting and a very important question especially if you are in the technology business. Let’s say you want to invest in a multi-million dollar Business Process Management System, and you CFO/CEO asks you, is this going to reduce my headcount and cut labor cost? This is a very similar question to what would happen if a Firm were to outsource a specific function. Does the outsourcing result in reduction of headcount, or expansion in business activities? Pretty tricky question! And you need to have the answers handy before you pitch any proposal to the savvy CxO level person. Remember there are people’s livelihoods hanging in the balance including the IT department you deal with on a daily basis.

Let’s zoom back out a bit here: It is well-known that the aim of the firm, and by extension, IT spending in it, have a shared goal in common, lowering the friction and costs which might arise from such friction within the firm. Therefore, if the IT project is successful (Big IF), then naturally, the friction costs should go down meaning the freeing up of resources and the increase of organization surplus and the freedom to carry out business expansionary activities if all the activities are carried within the firm and not utilizing an assembly line of contractors to do the job each has been designated to do.

So what does research tells us: According to a paper entitled: “An Empirical Analysis of the Relationship Between Information Technology and Firm Size” by one of my favorite Technology Economist of MIT, Erik Brynjolfsson, et al (link here), the evidence and answer is highlight dependent on the organization itself.

Let me clarify: If there are a lot of low-level jobs in the firms to be automated away by technology then those jobs are gone forever. You might say that upskilling those folks to perform higher-level jobs is a possibility, however, it is highly unlikely. This is short team. Long-term however, if there are plans in place to expand the business and there is a need for domain-knowledge, those folks who were automated away in the short-term, can be repurposed and gains can be reaped based on the strategy the company has in place.

From an employee stand-point though, skilling up should be a never-ending endeavor to keep one’s skillset up-to-update in an ever shifting market place.  Remember, outsourcing is not all that cracked up to be and there are many ways to circumvent the outsourcing hammer.

In the next segment, I will share how outsourcing has actually fallen flat on its face and what people did to get back in the employment game through the same companies their jobs were outsourced to….

Open source software, Microsoft Strategy and the dynamics of multi-sided platforms.

You cannot beat FREE!… The Free Software movement shook the foundation of the software industry and posed a serious challenge to the very essence of capitalism in this industry. It forced vendors to offer “basic” versions of their software as an open-source, free license entry point alternative.

Microsoft is one of those vendors. Two of the severely under-advertised are two Microsoft programs for Students and Start-ups to get full version software is DreamSpark and BizSpark. It works like this: If you are a student and your school has an agreement with Microsoft, you can get a lot of free Microsoft software ranging from Visual Studio Pro to Server 2012 R2 and SQL Server 2012, Operating Systems and a load of other goodies to get you on your way to develop software for the traditional desktop, cloud, App Stores (both Windows and Mobile)…

The process is very simple. Just get to the dreamspark.com site and check if your institution has a participant agreement in place and you are set to go. DreamSpark works in a similar fashion, however, it is geared towards start-ups with a preset revenue limit and a number of developers. The cost is a few hundred dollars. This is kind of Microsoft’s strategy to influence the incubation phase of a business in an attempt to try and steer them clear of the LAMP stack.

This phenomena is not new. Most people are unaware of lowering of market entry cost by any software vendor is a very well-known strategy in any multi-sided platform business model. It is very well studied and researched in the gaming industry where the entry cost of game developers is heavily subsidized by the gaming/software platform company or sometimes the reverse is true. Of course, the subsidy is supported indirectly by the customer who always ends up being the party who ends up paying for the cost of the subsidy given to the game/software developers.

Depending on what side of the platform you are on, having a strategy is very important especially if you can pool your industry peers together to garner some buying/negotiating power. Sometimes though, being the customer is most disadvantageous position in the multi-sided platform game. Gaming companies have been lowering entry costs steadily for the past few years and if you are a product manager, there are quite a few lessons to be applied from such industry as well as a the open-source/license movement to move your wares… TBC

Does Being an IT Specialist or a Generalist Impact Your Income?

I profess that I am not an HR expert, however, I have mentored and given career advice to many folks around the globe. The most common question I get and contemplate myself is, does career specialization impact your market value.

In Adam Smith’s pin factory, he poetically describes how specialization and division of labor give rise to economies of scale and made mass production a possibility leading to consumerism which is the ideal capitalistic manifestation of the American Dream we are currently trying to export to China and the world!.  The model in a nutshell goes as follows: you have got to buy stuff. Lots of it. Go in debt doing so. Why? So you can promote mass production, consumerism and division of labor, economies of scale thus working for the rest of your life to pay down your debt, and you get the picture.

The question remains though, how deep should our specialized skills be? And how does it affect our income? A colleague of mine beautifully stated once that our skills are similar to the letter T. The horizontal top is your general skills, and the vertical portion is your specialization. When I went for my Masters’ Degree in Economics he told me that my new skills resemble the letter n or the Greek Pi (Π) where I have two in-depth specialization.

So, which letter should your skills resemble and which one is most lucrative these days? If you are the type and specialize in a niche market, say stored procedures on mainframe DB2, then you are in a different league than say a Java developer. The problem: It’s easy to outsource your job. Why? It’s too specialized to a degree that it is easily definable. The same goes for just about the majority of such types with clear-cut division of labor and specialization and a uniform job description.

Now, if you are the Pi (Π) type, where you multi-specialize, then it will be harder but not impossible for you to be indispensable. Remember, we operate in a labor market without borders where you can hire five for the price of one or two.

So, what’s the answer? In my opinion:

  1. In the age of globalization, with
  2. A highly outsourced corporate functions,
  3. Large wage gaps in an global labor pool (for now anyway)
  4. And outsourcing companies buying local outfits to assimilate and “look and feel” more local thus expanding market share while maintain a small local presence

Your best bet is to have either models, the (Π) or the type with an extended horizontal top implying more generalized skills which include great soft skillsdeep local knowledgesocial capital through networking and most importantly working in a hard-to-define jobs with vague requirements and responsibilities where you have to wear many hats (think Startups). Finally, add a Security clearance requirement and you have the proverbial cherry on top!!