Apache Spark is a open source   clustered frame work  developed in  the year 2009 and  released in the year 2010.  It is based on Hadoop Map Reduce and extends Map reduce model to efficiently  use for more types of computations , which includes memory cluster computing.  That increase the processing speed of an application.    It is a  general purpose engine for large scale data processing . It supports rapid application for big data which allows code code reuse across batch , streaming and interactive applications .It s most popular use cases include building data  pipe lines and developing machine learning models.  Its core , the heart of the project  provides distributed task transmissions,  I/o functionality  and scheduling with a potentially  faster and flexible alternative to Map Reduce.  Spark developers says that , when processes , it  is 100 times faster than Map Reduce and 10 times faster than  disk. Apache Spark requires cluster manager .

Get in touch with OnlineITGuru for mastering the Big Data Hadoop Online Course

 Apache Spark requires a cluster manager and a distributed storage system. For cluster management spark supports Stand alone, Hadoop YARN.  For  distributed storage it can interface with wide variety  which includes Cassandra , Hadoop Distributed file System, Map Reduce.  In cases like where storage is not required  and local file system can be used instead it supports pseudo distributed local mode for development and testing purposes . In such cases , Spark runs on a single machine with one executor per CPU Core .

Components :  

The Apache Spark has following components . Lets us discuss each   in detail.

 Apache Spark Core:

it is  the  main component of Spark which is used as a general execution engine for spark platform where  all the other  functionality for Spark  is built upon .  It provides the In- memory computing and referencing data sets in external storage systems.

Spark SQL:

 It is a  component  that built on the top of  Spark core  for the purpose of new data abstraction  called Schema RDD . It that provides support for Structured and  Semi Structured data

Spark Streaming :

 It maintain Spark cores fast scheduling capability to  perform Stream  data analytics.   It  performs transformation of data by taking data in mini batches and performs   RDD ( resilient distributed data bases ) on thos data .

MLlib(Machine Learning Library ):

 It is a  distributed machine learning framework   which was placed above spark  because of is distributed memory based architecture . It was designed against the Alternative Least Squares implementations . it  has high efficiency which is   nine times as fast as Hadoop disk based  version of Apache Mahout.

Graph x :

 It is a distributed graph processing framework built on the top of Spark.  It Provide API for expressing  Graph computations which can model user defined graphs  by using Pregel Abstraction API.

spark sql/OnlineITGuru


Apache Spark  has a capacity of  processing  data from a variety of  data repositories  like  Hadoop distributed  File system , No Sql Databases  and Relational Data bases such as Hive . The  performance of Big Data analytics applications can be increases by Apache Spark in memory processing  ,  but it can also perform conventional disk based processing  when the data is too large to fit into the existing memory  .

 Features :

 The features of Spark were discussed below:

Speed :

Spark process the data  with a great speed . It can  run applications in a Hadoop  cluster up to 100 times faster in memory  and 10 times faster  when running on disk.  The greatest advantage of Spark is  that we can  reduce  the number of read /write operations on the  disk.  It stores the intermediate processing data in the memory .

Stand Alone :

Spark  standalone  means it occupies the place on the  top of Hadoop distributed File system and space is allocated for HDFS , explicitly. Here   Spark and Map Reduce run  Side  by side to cover all the spark jobs  on  cluster.

Hadoop Yarn :

The major advantage of  spark is that it allows Yarn without any pre-installation  or root access required . It helps  to integrate spark with Hadoop Ecosystem or Hadoop. It allows other components to  run on the top on stack.

Advanced Analytics :

Spark supports queries from Map and Reduce along with SQL queries , Streaming data , Machine Learning and Graph Algorithms.

Recommended Audience :

Software developers

ETL  developers

Project Managers

Team Lead’s

Business Analyst


There is nothing much prerequisite for learning Big Data Hadoop .It’s  good to have a knowledge on  some  OOPs Concepts . But it is not mandatory .Our Trainers will teach you if you don’t have a knowledge on  those OOPs Concepts

Become a Master in  Spark from OnlineITGuru Experts through Big Data Hadoop online  course in Bangalore

Drop Us A Query

100% Secure Payments. All major credit & debit cards accepted.