JUMP THROUGH HOOPS – WITH HADOOP

With the large amount of data being generated in a span of microseconds the data storage and processing becomes a huge problem.

For years, the companies have faced a problem of storing the large sums of data, hence they decided to distribute and store it over various platforms, process it over a network to generate the required solutions but it was no where near to permanent solution that was required.While the processing power of application servers has been increasing manifold, databases have lagged behind due to their limited capacity and speed.HADOOP was a hope to overcome these limitations of capacity and space.

Today, as many applications are generating big data to be processed, Hadoop plays a significant role in providing a much-needed makeover to the database world.

Some facts about HADOOP:

  • It is an open source software.
  • It was developed by Apache Software Foundation.
  • It is JAVA based framework.
  • It processes big data.

This technology solved the storage and processing problems of the big data by processing it in parallel and distributed fashion. Data that was recieved could be structured, unstructured or semi structured. Initially, all the data irrespective of type is dumped in HDFS . Later, with the MapReduce processing algorithm the data stored in HDFS is processed in a parallel manner.

The core components of Hadoop :

HDFS: Maintaining the Distributed File System

HDFS is the pillar of Hadoop that maintains the distributed file system. It makes it possible to store and replicate data across multiple servers.

HDFS stands for Hadoop distributed file system.It stores data in the form of blocks whose size can be configured on the basis of requirement. Data is stored in various data nodes with addresses of these stored in the master node.

HDFS AT A GLANCE

Vertical scaling means adding up resources to the existing data nodes in order to expand. What if there is an exponential increase in database. In that particular scenario we have horizontal scaling which is basically increasing the number of data nodes. To prevent data loss we have various replicas of data nodes stored in each other.

HDFS has a NameNode and DataNode.

Data nodes are the commodity servers where the data is actually stored.They are slave nodes. They are a hardware commodity. Hence very affordable.

The NameNode, on the other hand, contains metadata with information on the data stored in the different nodes. The application only interacts with the NameNode, which communicates with data nodes as required.

YARN: Yet Another Resource Negotiator

YARN stands for Yet Another Resource Negotiator. It manages and schedules the resources, and decides what should happen in each data node. The central master node that manages all processing requests is called the Resource Manager. The Resource Manager interacts with Node Managers; every slave datanode has its own Node Manager to execute tasks.

MapReduce

MapReduce is a software that helps in writing applications that deal with large amounts of data sets using parallel and distributed algorithm inside Hadoop environment.

It works on the basis of two functions — Map() and Reduce() — that parse the data in a quick and efficient manner. First, the Map function groups, filters, and sorts multiple data sets in parallel to produce tuples (key, value pairs). Then, the Reduce function aggregates the data from these tuples to produce the desired output.

This programming model was first used by Google for indexing its search operations.

Hadoop introduces data diversity, resilience and scalability to the data management.

Hadoop is a ‘big data’ processing tool. In ML we deal with large sums of data, so knowing Hadoop technology becomes an integral part of our journey. I have tried to cover Hadoop basics in introductory level keeping the very same thing in mind. I hope you like the blog!

Hadoop is a 10 hour learning course if you are comfortable with JAVA and an extra 5 hours of hands on training. You could give it a try and tick this one of your list. ALL THE BEST!!!

Leave a comment