Wednesday, February 13, 2013

Big Data -- The Art Of The Possible[1]


What is big data? How big is it?

BIG data, according to Wikipedia, is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. Dough Laney used a three-dimensional concept to describe the opportunities and challenges of data growth, i.e. increasing Volume (amount of data), Velocity of data (speed of data in and out),  and Variety (range of data types and sources), which is known as "3Vs" model[2]. Then, IBM expanded that concept to a fourth dimension, Veracity, out of the fact that 1 in 3 business leaders don't trust the information they base their decision upon. However, the definition for "big" varies from organization to organization based on their scopes and capabilities. Following examples may be able to give an idea about how "big:" it actually is in the modern context.

  • Walmart handles more than 1 million customer transactions every hour, which is imported into databases estimated to contain more than 2.5 petabytes (2560 terabytes) of data – the equivalent of 167 times the information contained in all the books in the US Library of Congress.
  • Facebook, a social-networking website, holds 40 billion photos sent or shared by its users.[3]
Now let's take a step back from those large organizations, and have a look at how this explosion of information impacts average US households. Researcher at University of California in San Diego (UCSD) carried out this research to examine the consumption of information by American households, and according to the result, "in 2008 such households were bombarded with 3.6 zettabytes of information (or 34 gigabytes per person per day)", majority of which came from video games and TV programs. [4]

How to make sense of big data?



UNLIKE playing video games and watching TV, the purpose of which is not usually to digest the information and internalize the knowledge, large organizations and scientific research institutions can actually benefit a lot from this unprecedented amount of data now are made available by the development of digital devices. However, the amount itself doesn't increase the credibility of the information 
extracted from the data. It even works the opposite way under some circumstances, especially with interdisciplinary researches coming into sight. Can we tell the economic trend by looking into people's browsing behavior? Is the result even reliable enough to be worth the effort? There are a lot of questions to be asked before we dive into the sea of  big data, as well as professional skills to analyze the possible hideous results we may get after making that decision. That's where the demand for a new kind of professional emerged, the data scientist. By job description, a data scientist should be able to work as a “software programmer, statistician and storyteller/artist to extract the nuggets of gold hidden under the mountains of data.”[3]

From my own experience, when doing data mining it's important to first come up with questions that each specific task tries to solve, but keep an open mind for any possible results. Because very often times, we just cannot get anticipated results, and interpretation is never the easiest part due to the counterintuitive implication. So I guess when working with big data, what's really needed is the full skill set of a data scientist, an open mind, and a handy toolbox.

What are the most commonly used the tools?

  • Hadoop: Created by Doug Cutting, named after his kid's stuffed toy elephant, Hadoop is "a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment". [5] Hadoop provides an ideal computing solution that is ideal in terms of scalability, cost efficiency, flexibility, and fault tolerance.
  • No SQL databases: Think Hadoop as a tool to organize the racks of servers, then No SQL databases would be the storage solution for the data stored in all those servers. It is a perfect match for Hadoop, because "it is designed to provide highly reliable, scalable and available data storage across a configurable set of systems that function as storage nodes".[6]
Besides Hadoop and No SQL databases, there are other built tools for programmers to utilize too. Examples are BitDeli, Google Prediction API, Flurry, etc. For programmers that are more involved with coding instead of controlling data flow, these ready-to-use apps can facilitate the analytics process a lot, or just simply direct it to the more effort efficient and profitable track.

AFTER the discussion above, I guess it's safe to the draw the conclusion that Big Data is a sea with both unknown treasure and unpredictable challenges, but is also tamable with a group of sailors with right skills and instruments. Hopefully, we can all find our ultimate "one piece".

references