We live in a world which is highly driven by data, the data that might be both structured, semi-structured or unstructured.
According to an estimate, by the year 2020 our accumulated digital universe of data will grow from 4.4 zettabytes today to around 44 zettabytes (44 trillion gigabytes or to put in perspective, 44,000,000,000,000 GB).
In the present scenario, how an organisation chooses the data strategy approach will determine its future.
WHAT is big data?
Big Data in layman’s terms is the data that is too big to be processed on a single machine. It is the data whose scale, diversity and complexity requires new architecture, techniques, algorithms and analytics to manage it and extract values and hidden knowledge from it.
“Fun Fact : We create as much information in just few minutes now as we did from the dawn of man through 2003.”
The world of big data is governed by its three dimensions (the three Vs) –
1. Volume refers to the amount of data.
2. Variety refers to the number of types of data
3. Velocity is the speed of processing data.
The real challenge in today’s world is to gain meaningful information out of the structured and unstructured data.
Analyzing the Big Data : Hadoop
Hadoop is an Apache open source framework written in Java that allows distributed processing of large datasets across clusters of computers using simple programming models. It is optimized to handle massive quantities of data that could be structured, semi-structured or unstructured using commodity hardware ,that is, relatively inexpensive computers . Hadoop framework can be easily understood using the figure below. We would be covering HDFS and MapReduce in this blog and rest of the Hadoop ecosystem tools would be discussed in the upcoming blogs.
- HDFS (Hadoop Distributed File System) stores data on the cluster.
- MapReduce processes data on the cluster.
HDFS : Hadoop Distributed File System
HDFS basically stores the files by splitting the file into chunks called blocks across various clusters. Each block is very big and by default its size is 64 MB. As the file is uploaded to HDFS each block will get stored on one node in the cluster and for every node in a cluster there will be a daemon (a piece of software running on each machine) in the cluster called DataNode. The DataNode performs read write operations on the system as per client request. Hadoop replicates each DataNode multiple times so even if one fails it won’t be a problem because we have other copies left .
Since, we need to know which blocks combined together make up the original file, need of another node arises which is fulfilled by NameNode. There is a single NameNode in the system which stores the metadata information (i.e. file and directory permissions, ownerships, and assigned blocks). That’s why we call a NameNode as a single point of failure. However, this problem can be solved by using a Secondary NameNode (SNN) which doesn’t actually act as a NameNode but stores the image of primary NameNode at certain checkpoint and is used as backup to restore NameNode. There is only one NameNode in a cluster and many DataNodes.
MapReduce is the heart of Hadoop ecosystem. In order to process a big file, a huge amount of time is spent. MapReduce is designed to process data in parallel, so your input data is split into chunks and then processed in parallel across the nodes in your cluster.
The MapReduce algorithm contains two important phases –
- Mapper – The Map task takes a set of data and converts it into another set of data, where individual elements are broken down into tuples (key-value pairs).
- Reducer – It is constituted by shuffling,sorting and reducing the data provided by Mapper. Shuffling is the process of transferring data from mappers to reducers and the data is sorted automatically before reaching the reducers. The Reduce task takes the output after shuffling and sorting as an input and combines those data tuples (key-value pairs) into a smaller set of tuples.
The whole process of MapReduce can be easily understood with the help of an example. Suppose we have three sentences in an unprocessed form as displayed in the input box -diagram in figure below and we intend to count the number of times for the occurrence of each word across the whole file, then the steps displayed in the figure are executed sequentially.
MapReduce logic, unlike other data frameworks, is not restricted to just structured datasets. Mapper brings a structure to unstructured data. Consider a scenario where we need to analyse the number of photographs on my laptop by location where the photo was taken.I n this case, mapper makes (key, value) pairs from this data set. Here the key will be the location and value will be the photograph. After mapper is done with its task, we have a structure to the entire dataset.
MapReduce processing in Hadoop 1.0 is handled by the JobTracker and TaskTracker daemons. When we run a MapReduce job, we submit the job to what’s called the Job Tracker. It is the Job Tracker which splits the work into mappers and reducers and running the actual map and reduce tasks is handled by a daemon called the TaskTracker. The TaskTracker software will run on the existing individual nodes. Since the task tracker runs on the same machine as the data node, the Hadoop framework will be able to work directly on the pieces of data thus saving a lot of network traffic. In case the TaskTracker fails, replica available on other node is used for carrying out the task.
To end it for now, we have tried to cover the basic aspects of big data, Hadoop, HDFS and MapReduce. Next blog post in this series will cover other utility packages in Hadoop Framework.
Some of the links to resources that I found particularly useful are :
Hope you had a good read on the topic. Any comments or suggestions are highly welcome. 🙂