In Big data, the data is maintained in huge amounts.Big data contains two major challenges. The major thing about Big data is it should be maintained carefully and the second thing is that it should be analyzed carefully .To overcome this problem a message system is being required.
Learn more about this technology Big Data Hadoop Online Training in this overview.
A Messaging system is responsible for transferring of data between applications. The application concentrates on maintaining and collecting the data, but it does not bother about how the data is being placed and organized.
In Messaging system, data is being transferred in two ways :
In point to point message system, the source and destination were fixed before the data was sent. The data transfers here travels securely in a random fashion. The disadvantage of this system is all the messages were sent in queue sequentially. There is no chance of sending particular intermediate message if there is an important message that needs to be sent. All the messages need to be waiting until its turn. More over there is no chance of sending messages to number of destinations at a time. To overcome this problem Point – subscribe method was introduced.
In publish subscribe system, the data senders are called the Publishers and the data receivers are called the Subscribers .For one Publisher, they can be multiple subscribers. The real time example of Dish TV. Here the producers are the owner of Dish TV and the consumer is Television users. Here as Television users can subscribe the channels as per their needs.
Kafka is a publish service message system developed by Linkedin in the year 2012 for stream analysis of Strom and Spark.This system is built on the top of Zookeeper Synchronization service . The Kafka can handle large volume s of data and is responsible for the transferring of message between application for both Online and Offline message consumption . Kafka can handle the large volumes of data with a great speed . Its efficiency is 2 millions writes /sec. The messages in the kafka are persisted in a disk and can be replaced with a replaced with a cluster at the time of failure . The major advantage of Kafka is low latency and high Fault tolerance
The architecture of Kafka can be explained with the following diagram:
Before going to know about its working lets know some components in the Kafka ecosystem:
A producer is responsible for transferring data to the broker. When a new broker enters into the ecosystem, all the producers starts sending data into it . The producers does not bother about the acknowledgements from the broker and sends the data as far as it can handle .
since the data handled in the eco system is in tera bytes it maintains multiple brokers in the ecosystem. Each Kafka instance can handle hundreds and thousands of reads and writes per second . Among those many brokers there will be one leader and number of followers . If the leader falls down , automatically one of the followers will become a leader .
The consumer is responsible for handling the data from the broker . Since the broker doesn’t acknowledge the data received to the producer, the consumer acknowledges the data is received from the broker through the off set value . If the consumer acknowledges a off set value means the receive all the data up to that particular index which is notified by the Apache Zookeeper. The advantage to the consumer is that , it can stop (or) skip the flow of messages at any instant.
Zookeeper is responsible for coordinating the actions between Producers and Consumers . Its major role is to notify about the presence or absence of nodes and the transmissions of data in the ecosystem.
Get in touch with OnlineITGuru for mastering the Big Data Hadoop Online Course .