Limited Period Offer - Upto 50% OFF | OFFER ENDING IN: 0 D 0 H 0 M 0 S

Log In to start Learning

Login via

  • Home
  • Blog
  • Explain about Apache Spark ...
Post By Admin Last Updated At 2020-09-09
Explain about Apache Spark Parallel Processing

Apache Spark is the fastest uniform analytics engine useful for big data and machine learning. As Apache Spark is fast in processing it takes the benefit of in-memory computing and other optimizations. Moreover, Apache Spark uses RDDs for parallel processing performance across a cluster or system processor.

It includes easy useful APIs for operating on big datasets, in multiple programming languages. It also includes various APIs for data transformation and familiar data frame APIs for modifying semi-structured data.

Generally, it uses a cluster manager to synchronize work across a cluster of computer systems. A cluster is a group of systems connected. It correlates with each other to compute and data processing.

Apache Spark applications also include a driver process and executor processes.

Apache Hadoop is still ruling the Big Data world and is the primary choice for Big Data Analytics. But it is cannot be optimized for a special type of workload. The reasons left for this problem is as follows:

  •   Apache Hadoop doesn’t include iteration support and also doesn’t support the data flow cycle where the result of a former stage is the input to the following stage.
  •   On disk, there is constant intermediate data, and this is the reason for high latency within the Hadoop structure. Moreover, the MapReduce framework is relatively slower as it provides support for various structures, formats, and amount of data.

Apart from this, Apache Spark has satisfied the need for parallel performance needs of data analytics professionals. Now it became a unique tool within the Big Data community.

Spark process files in parallel

All we know that Apache Spark groups our application into many smaller tasks and assigns them to executors. Thus, it executes the app in parallel mode. But we have to understand the internal mechanism that how does the Spark divides our code into different small tasks and runs it in parallel? Let us discuss them in detail.

There is a driver program under its cluster. Here, the application of logical execution is stored. Hence, the data processing is in parallel with different workers. This type of data processing is not a best practice, but this is how it generally happens. Among the worker’s data placed in parallel. It is combined within the cluster across the same set of systems.

The driver program while the performance passes the code into the worker systems where processing will be held of the equal group of data. To prevent data from changing across systems, the data will have to go through multiple steps of transformation. All happens while staying within the same grouping. At the worker systems, actions will run and the output returns to the driver program.

Get more insights on data processing from the expert’s voice in Big Data Online Course.

Spark RDD API

Resilient Distributed Dataset or RDD is like a playing card of Apache Spark technology. This is one of the important distribution data structures. Among different systems, it is sub-divided within a cluster. This is also a central system. Under a cluster, inter-system data shuffling is reduced by controlling. Here, multiple RDDs are got separation. Moreover, there is a ‘partition-by’ operator redistributes the data in the original RDD and builds a new RDD across systems within the cluster.

We can also describe Apache Spark RDD in the following way.

  • Data collection - RDD holds data and displays like a Scala Collection.
  • Resilient – These can recover data from a failure, so they are fault-tolerant also.
  • Partitioned – The framework divides the Apache Spark RDD into smaller pieces of data. These are also known as partitions.
  • Distributed – Instead of keeping those partitions on a single system, the framework distributes them across the cluster. These are known as distributed data collection systems.
  • Immutable – These are immutable as we can't alter an RDD. So the Apache Spark RDD includes a read-only data structure.

Apache Spark RDDs offer to perform two different types of operations.

Transformations

Actions

Transformations

The “transformations” operation builds a new distributed dataset from a currently distributed dataset. Thus, they build a new RDD from the present RDD. The following are a few transformation types;

  •      Map (func) - it returns a new disperse dataset build by passing each element of the source via a function called func.
  •      Filter (func) - it returns a new dataset developed by choosing those elements of the source on which func returns true.
  •      flatMap (func) – it is equal to map, but each input item can be mapped to 0 or more resulting items (so func should return a Seq instead of a single item).
  •     mapPartitions (func) – Equal to map, but runs individually on each block of the RDD, so the func must be of “Iterator => Iterator” type while running on an RDD of type “T”.
  •    Sample (withReplacement, fraction, seed)  -Sample a fraction “fraction” of the data, with or without replacement, with a given random number generator seed.
  •     Union(otherDataset)- it returns the latest dataset that includes the union of the elements within the source dataset and the argument.
  •    intersection(otherDataset)- this returns a new RDD that includes the intersection of elements under the source dataset and the argument.
  •     distinct([numTasks]))- Returns a latest dataset that includes the distinct elements of the dataset source.

||{"title":"Master in Big Data", "subTitle":"Big Data Certification Training by ITGURU's", "btnTitle":"View Details","url":"https://onlineitguru.com/big-data-hadoop-training.html","boxType":"demo","videoId":"UCTQZKLlixE"}||

Actions

The “Actions” are generally useful to send output back to the driver. And hence they develop a non-distributed dataset.

  •   reduce(func) – gather the elements of the dataset with the use of a function func (it takes two arguments and returns a single one). The function should be commutative and associative to compute correctly in parallel.
  •   collect() – Returns all the elements within the dataset as an array at the driver program. This is generally useful after a filter or other operation that returns enough small subset of the data.
  •    count() – it returns the number of elements that exist in the dataset.
  •    first() – it returns the first or starting element of the dataset (similar to take(1)).
  •    take(n) – it returns an array with the starting “n” elements within the dataset.

Users can build an RDD using the following two methods.

·      Load data from a source or location.

·      Build an RDD by changing another RDD.

The above two methods are very useful to build a custom Spark RDD easily.

Parallel Processing of Apache Spark RDD

The parallel processing spark RDD in an order or series is performed using the following steps:

  • RDD is built from an external data source like a local file or HDFS.
  • RDD goes through a series of parallel transformations such as filter, map, and join where each transformation/change gives a different RDD. Moreover, the same gets fed to the next transformation stage.
  • The last phase is the action stage. Hence, the RDD is exported to external data sources as a result.

The above mentioned three stages of parallel processing are something similar to the topological type of DAG. Here, immutability or un-change is the key where an RDD after processing this way can’t be modified back or tinkered with in anyway. In case the RDD not useful as a cache then it will be useful to feed the upcoming transformation. This is useful to generate the next RDD. This is then useful to produce some action results.

Now you might already get an idea of how fault tolerance happens within the Cloud and Big Data machines. Moreover, a dataset is made duplicate across different data centers. This is in the case of Cloud systems or nodes in the case of Big Data machines. Sometimes there may be any natural calamity, disasters or any bad incident happens to a dataset in a specific data center or node. In this case, the dataset from another data center or node is taken as back-up and used.

Advantages of RDDs

The following are the various advantages of RDD (Resilient Distributed Dataset):

  •    Lazy Evaluation– The data within the RDD is not assessed or evaluated until action is started for computation.
  •   In-Memory Computation – In this, the data within RDDs are stored in memory instead of the disk to increase the performance many times.
  •    Immutability – Once an RDD is built it cannot be changed. Moreover, any change in it will develop a new RDD.
  •    Fault Tolerance – In case a worker node goes down, using Lineage then we can re-calculate the lost division of RDD from the original one.
  •   Partitioning – RDDs are divided into smaller blocks called partitions (logical blocks of data), when some actions are performed, a task is launched per partition. The partitions number is directly responsible for the parallel process.
  •    Persistence – Users can specify the type of RDDs they will reuse and select a storage strategy for them as a memory or disk.
  •   Location-Stickiness – RDDs have the capacity of defining placement preference (info regarding the location of RDD) to calculate partitions. The DAG Scheduler places the partitions in a way that work is close to data as much as possible. It also speeds up the computation.
  • Typed – We have different types of RDDs which include: RDD [long], RDD [int], and RDD [String].
  •   No Limitation – There is no limitation in the usage of the number of RDDs. Users can have as many as they need. Its limit depends on the size of the memory or disk.

Fault Resilience of Apache Spark

Spark has a different approach to fault resilience or flexibility. This is essentially a highly efficient and large computation cluster, and it doesn’t have a storage capacity like the Hadoop includes HDFS. Spark takes as clearly two presumptions of the workloads that come towards it for being processed:

Spark expects that the processing time is limited. Clearly, the cost of retaining is higher while the processing time is higher.

Apache Spark also expects that external data sources are accountable for data tenacity in the parallel processing of data. Therefore, the accountability of stabilizing the data at the time the processing falls on them.

Spark again performs the earlier steps to recover the lost data to compensate for the same during the performance. Not all performances need to be done from the start. Only those blocks in the main RDDs that were responsible for the faulty partitions need to be re-performed. In small dependencies, this process resolves to the same system.

Users can imagine the re-performance of the lost partition as something equal to the DAG lazy/slow execution. The lazy assessment begins from the leaf node tracing via the parent nodes and finally reaching the source node in such a cross. Comparing to slow evaluation, here, there is a need for an extra part of the information. Like the partition to identify which parent RDD is a requirement.

Re-execution of various dependencies in this trend will result in the re-performance of all things as it can hold on several parent RDDs across different machines. Besides, how Apache Spark tackles this issue is worth taking a note. It holds the output intermediate data from a mapper function and it sends the same to various systems after shuffling it. We have to keep in mind that Apache Spark performs such operations on various mapper functions in parallel.

Parallel operations on various partitions

RDD operations put on in parallel mode on each block. Moreover, tasks are put on the Worker Nodes where there is data storage.

Some operations protect partitioning, like map, flatMap, filter, distinct, and so on. Some operations repartition or re-block, like reduceByKey, sortByKey, join, and so on.

||{"title":"Master in Big Data", "subTitle":"Big Data Certification Training by ITGURU's", "btnTitle":"View Details","url":"https://onlineitguru.com/big-data-hadoop-training.html","boxType":"reg"}||

Operations in different stages

The operations of this framework can run on the same partition and perform in different stages. Jobs within a stage pipelined together. Developers must be aware of all operational stages to enhance performance.

The below listed are some of the Apache Spark expressions:

  • Job: Job a set of tasks performed as a result of an action.
  • Stage: This is a set of tasks within a job that can be performed in parallel.
  • Task: A task is an individual unit of work.
  • Application: An Application may include any number of tasks managed by an individual driver.

Spark parallel processing example

Many companies are there using Apache Spark to enhance their business insights. These companies extract large amounts of data from users and use the same to improve consumer services. A few of the use cases are as follows:

E-commerce: Many e-commerce giants use Apache Spark to enhance their consumer experience. A few of the organizations that implement this technology to achieve the experience are eBay and Alibaba. Due to enhancement in the e-commerce business, many companies are trying to develop more attractive insights for their online presence.

Healthcare: The framework is useful for numerous healthcare business entities. Through this, it provides its customers with better services. For example, a company like MyFitnessPal uses this technology. This helps people to get a healthy lifestyle through diet and exercise. In this way, the healthcare sector is getting the benefits of technology for better improvements in providing services.

Media & Entertainment: Technology is much useful in the media industry. Moreover, a few of the video streaming websites use it, along with MongoDB. It shows relevant ads to their users based on their earlier activity on that website. For example, Netflix is one of those websites that perform such activities. As the demand for online videos is increasing the need for several platforms also scaling.

Bottom Line

Thus, we have come across the process of parallel execution using Apache Spark. In providing low-latency with high parallel processing for Big Data Analytics, Apache Spark has kept its word. A user can run action and transformation operations in extensively used programming languages like Java, Scala, and Python. It includes a lot of stages and strategies. Apache Spark is useful for parallel execution because of faster processing ability. Moreover, it will be progressively useful in the future. It will be closely related to the real-time analytics framework in the coming future. To know more about such a process in practical, go through the Big Data Online Training.