What is Hadoop?
In general terms, Hadoop is a software which helps you to
derive the information from both structured and unstructured data. Consider the
following scenario: There is one gigabyte of data that you need to process and
the data is stored in your desktop. You might not have problem in storing and
processing this data. Then your company starts to grow very quickly and the
data increases to 100 gigabytes.You start to reach the limits of your computer.
So you scale up by investing in a larger computer. Again, your data grows and
this time it reaches 100 terabytes, you again reached the limits of your
computer. Moreover, you are now asked to feed you application with unstructured
data coming from sources like facebook., twitter, etc. What will you do now?
This is very Hadoop can help.
Hadoop is an open source project of the Apache Foundation. It
is a framework written in Java originally developed by Doug Cutting who
named it after his son's toy elephant. Its uses Google's map reduce and files
system technologies as its foundation.
What
Hadoop is for?
Hadoop is optimized to handle massive quantities of data
which could be structured, unstructured and semi-structured using commodity
hardware that is relatively inexpensive computers. That massive processing is
done with great performance, however, it is batch operation which handles massive
quantities of data, so the response time is not immediate.
Hadoop can analyze and access the unprecedented volumes of
data churned out by the Internet in much more easier and cheaper way. It can
map information spread across thousands of cheap computers and can create an
easier means for writing analytical queries. Engineers no longer have to solve
a grand computer science challenge every time they want to dig into data.
Instead, they simply ask a question.
Few examples were Hadoop can help:
Hadoop applies to a bunch of markets. In finance, if you want
to do accurate portfolio evaluation and risk analysis, you can build
sophisticated models that are hard to jam into a database engine. But Hadoop
can handle it. In online retail, if you want to deliver better search answers
to your customers so they’re more likely to buy the thing you show them, that
sort of problem is well addressed by the platform Google built.[2]
Hadoop Architecture
MapReduce: The
MapReduce algorithm is a two-stage method for processing a large amount of
data. This process is abstract enough that it can accommodate many different
data-processing tasks, i.e., almost any operation one might want to perform on
a large set of data can be adapted to the MapReduce framework. The advantage of
the MapReduce abstraction is that it permits a user to design his or her
data-processing operations without concern for the inherent parallel or
distributed nature of the cluster itself.
HDFS: Hadoop
Distributed File System spends all the nodes in hadoop cluster for data
storage. HDFC gathers the file system on many local nodes to make them into one
big file system. It assumes that nodes would fail, so it achieves relaiability
by replicating data across multiple nodes.
Architecturally, the reason one is able to deal with lots of
data is because Hadoop spreads it out. And the reason one is able to ask
complicated computational questions is because it has got all of these
processors, working in parallel, harnessed together.
Four great characteristics of Hadoop:
1. Scalable: Hadoop is scalable. New nodes can be
added as needed without changing the data format, data loading methodology and
applications on top.
2. Cost effective: Hadoop brings massive parallel
computing to large clusters of regular servers. The result is a sizeable
decrease in cost per terabyte of storage which in turn makes analyzing all of
the data affordable.
3. Flexible: It can absorb any type of data structured
or not from any number of sources. Data from multiple sources can be joined and
aggregated in arbitrary way enabling deeper analysis than other systems can
provide.
4. Fault Tolerant: When a node is lost the system
redirects work to another location in the system and continues processing
without missing anything. All this happens without programmers writing special
codes. Even the knowledge of parallel processing infrastructure is not
required.
What Hadoop is not for?
Hadoop is not suitable for online transaction processing
workloads where data are randomly accessed on structured data like a relational
database. It is not suitable for online analytical processing or decision
support system workloads where data are randomly accessed on structured data
like a relational database to generate report that provide Business
Intelligence. Hadoop is used for Big Data. It complements online transaction
processing and online analytical processing but it is not a replacement for
relational database system.
Hadoop is not something which can solve all the
application/datacentre problems. It solves some specific problems for some
companies and organisations, but only after they have understood the technology
and where it is appropriate. If you start using Hadoop in the belief it is a
drop-in replacement for your database or SAN filesystem, you will be
disappointed.
Reccomendations by Ted Dunning: How to start using Hadoop [3]
“The best way to do it is to get in touch with the communities, and the best way always is in person. If you live near a big city there is almost certainly a big data or Hadoop MeetUp that meets roughly monthly. And then there are specialized things like data mining groups that also meet roughly monthly. If you could get to one of those and talk to people there, then you start to hear what they say how it works, and you can see how you can try these things out. Amazon makes available publicly a number of completely free resources of large data, they have large sets of genomes, they have US Census data, Apache archives for email from years and years of sample data sets that you can try all kinds of experiments on and I hate to be recommending of any commercial service but Amazon makes available Map reduce, Hadoop in small dozes with their EMR product so you can write and run small EMR programs against publically available data really quite easily.
You can then also run a Hadoop cluster. You can run it in the cloud small clusters, it just takes a few dollars to run a significant amount of hardware for a reasonable amount of time, ten hours, two hours, whatever it takes to learn that. You can also then run on cast off hardware and that’s how I started .We had ten machines at work that were so unreliable, that nobody would use them, and so they were fine for me to experiment with. These sorts of resources are often available if you don’t want to spend a little bit of money for the EC2 instances. So that’s the implementation of the very simple specification of just do it: find people, find mailing lists, and try out on real data. Those people will give you pointers and I think I would be happy to help people as well.”
References: [1] http://en.wikipedia.org/wiki/Apache_Hadoop
Interesting. Great intro to the topic. thanks.
ReplyDeleteInteresting. Great intro to the topic. thanks.
ReplyDeleteThank you Mr. Eric
DeleteI have been trying to understand Hadoop. Your article was very clear and helpful. Thank you!
ReplyDeleteGlad to know that it helped you in some manner. Thx :)
Deletewhy is hadoop not used for online transactions? Can hadoop be used in real time?
ReplyDeleteHadoop is designed for offline processing and analysis of large-scale data. It does not work for random reading and writing of a few records, which is the type load for online transaction processing. Hadoop is best used as a write-once, read-many-times type of data store.
Delete