Saturday, January 26, 2013

What is Data Analytics for Big Data?

There are a lot of blogs, podcasts, articles, even books about data analytics for Big Data or sometimes referred as Big Data Analytics. I wanted to write more on this subject because, Big Data or Big Data Analytics is not a buzz word that shines and disappears in a year or two. I believe Big Data or Big Data Analytics is something that we will be hearing for years to come. There is no question in my mind this will be a game changer towards data and data analytics in every field and industry. Big Data Analytics is no longer a specialized solution for cutting-edge technology companies. It is evolving into a viable, cost-effective way to store and analyze large volumes of data across almost all industries.

What is Web Analytics?

The Digital Analytics Association defines web analytics as the measure, collection, analysis and reporting of internet data for purposes of understanding and optimizing web usage.[1] You can read more about Web Analytics on my previous blog Web Analytics and Data Warehouse.

What is Big Data?

Big Data is a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, storage, search, sharing, analysis and visualization.[2]
Some examples of Big Data include medical records, photography archives, video archives, large scale e-commerce, internet search indexing, call detail records, astronomy, atmospheric science, genomics, biogeochemical, biological and other complex scientific researches, web logs, RFID, military surveillance and other similar data.
Big Data technologies like Apache Hadoop, open-source software framework, provide a framework for large-scale, distributed data storage and processing across clusters of hundreds or even thousands of networked computers. The objective is to provide scalable solution for this Big Data while minimizing the processing time.
In 2010 alone, our world produced one zetabyte (1,000,000,000,000 gigabytes) of data coming from five billion mobile phones, 30 billion posts shared on Facebook per month, and millions of networked sensors connected to mobile phones, energy meters, automobiles, shipping containers, retail packaging and more [3][4]
There have been different challenges for companies to implement Big Data and Big Data Analysis projects. Some of these are
Big Data Cloud
Photo Courtesy: [5]
·         For many years, companies faced upfront infrastructure cost for Big Data and Big Data Analysis projects. Also, companies were not able to respond to scale-out requirements because of infrastructure. This problem has been solved by Big Data cloud services like Amazon’s Elastic MapReduce or Microsoft’s Hadoop distribution for Windows Azure which enable companies to lease infrastructure for their Big Data projects.
Integrating Data warehouse with Big Data
Photo Courtesy: [6]
·         For most companies, integrating Big Data with other components of Data Warehouse environment is critical. Big Data does not replace Data Warehouse. Hadoop is built for fairly simple workloads, such as sorting, aggregating, converting, and filtering. It is not intended to manage schema structure and database security. Therefore, database management is still important for companies. The challenge has been how to integrate these two. IBM, Informatica, Microsoft, Oracle and SAP have released tools to interface Hadoop and relational database management systems which solved this problem.
Photo Courtesy: [7]
·         When we come to Big Data Analysis, getting user-friendly tools had been a challenge. Even though, there are some tools like Apache Pig and Apache Hive which provides SQL-like frameworks for advanced data analysts to run queries directly against data stored in Hadoop, these tools require technical expertise. Recently, Microsoft has announced the Hive ODBC driver and the Hive add-in for Excel which will allow end users to access data stored in Hadoop though Excel, Power Pivot and Analysis Services. Also, Tableau has released a tool that allow users to drag and drop Hadoop reports. These tools will allow end users to work on Big Data Analysis much more easily.
Since the above challenges have been resolved, in the coming years, we will see a dramatic growth on Big Data Analysis. Companies likely to get the most out of Big Data analytics include:[3]

Supply chain, logistics, and manufacturing
With RFID sensors, handheld scanners, and on-board GPS vehicle and shipment tracking produce vast quantities of information offering significant insight into route optimization, cost savings and operational efficiency.
Financial services
Financial markets generate immense quantities of stock market and banking transaction data that can help companies maximize trading opportunities or identify potentially fraudulent charges, among various users.
Energy and utilities
Smart instruments and electronic sensors attached to machinery, oil pipelines and equipment generate streams of incoming data that must be stored and analyzed to uncover and fix potential problems.
Media and telecommunications
Streaming media, smartphones, tablets, browsing behavior and text messages are captured at ever-increasing rates all over the world, representing a potential treasure trove of knowledge about user behavior and tastes.
Health care and life sciences
Electronic medical records systems are some of the most data-intensive systems in the world and making sense of all this data to provide patient treatment options and analyze data for clinical studies can have dramatic effect.
Retail and consumer products
Retailers can analyze vast quantities of sales transaction data to uncover patterns in users behavior and monitor brand awareness.