Basic introduction to Big data

What is Apache Hadoop?

“Apache Hadoop is an open source a software framework for storage and
large scale processing of data-sets on clusters of commodity hardware”

The most significant part about this is the power of large scale data processing on commodity hardware aka your regular server in a rack.

Understanding distributed computing

I like to understand the above concept by drawing an analogy of how elections or collection of senses data works in a country.

Each district or a constituency has a polling booth where people come and cast their vote.
At the end of the election, each ballot box(EVM machine) is tasked to add all the votes for each candidate.
Data from each ballot box from different polling stations is added together to find the final tally of votes for a particular political party You can see how we can use this concept to do large scale computation with our data. We store our single file, in chunks in multiple servers. Whenever there is some calculation which needs to be done, we can do that on each server and collect the result. The results from each server are again processed to get the final result. This is the core concept of distributed computing

Features of Hadoop

Open source: Apache licence
Highly Scalable: By adding new nodes and removing depending on demand
Usage of commodity hardware
Reliable: Replication of data and compute among multiple nodes

Apache framework Hadoop modules

Hadoop Common
Hadoop Distributed File System (HDFS)
Hadoop YARN
Hadoop MapReduce

Let’s study each of the concepts one by one

Hadoop Distributed File System

(HDFS)
HDFS is a file system on distributed nodes. It is synonymous to a usual file system on your personal computer but the underlying data is distributed across multiple server nodes.

Name node

As we know that our files are broken into chunks and stored across multiple servers to co-ordinate among the multiple servers in our cluster we have a concept called “Name node”.

Data node

Data nodes are servers which store the data and do the computation

Map reduce Engine

Map reduce is a programming model to do computation on a distributed parallel computing environment

Map: map task conducts the data processing on each data node

Reduce: reduce task collects the results from all the map task and reduce it to single results.

The business logic of what the map or reduce task should do is given by the programmer

Job tracker : Job tracker is a process in name node which gets the job from the client

Task tracker: task tracker are the processes which reside in the data nodes. They take care of carrying out the process of their respective nodes.

You can submit your jobs to the name node (job tracker) which schedules and tracks all the jobs

Each data node receives the tasks from job tracker, and it’s the job of the task tracker for conducting and tracking the assigned task. It reports the status of its task back to Job tracker

This was a basic uderstanding of the hadoop stack to carry out big data operation. As i explore into this field i will be posting more content on this subject.

let me know your thoughts on these mamoth data porcessing softwares.