In this article, you are going to learn the ultimate guide on Apache spark and will explore what is Spark Apache?
Apache Spark is a general-purpose cluster computing system that supports diverse kinds of large-scale data processing.
It is designed to process petabytes of data in memory or on disks, milliseconds in duration, or both.
Apache Spark is a cluster computing framework with an in-memory, distributed storage system created to be easy to use and efficient for programmers.
Apache Spark can be run as a standalone program, on Hadoop or Mesos, or it can be used as an “add-on” library.
Spark defines its own execution model that is different from the map and reduces the paradigm used by many other frameworks.
What is Spark?
Spark is a general-purpose, high-level programming language used to implement various techniques for big data processing like MapReduce, stream processing, and machine learning.
Spark Programming is based on Scala (a general-purpose programming language) and The spark lies in the heart of Apache Hadoop.
It is used to perform data analysis and ML operations on Hadoop Distributed File System (HDFS).
Spark is a data processing engine that is mainly used for Big Data. It provides the ability to work on HDFS, YARN, HBase & Cassandra.
The Spark applications run in cluster mode and it allows to processing of large data with parallel computing power.
The core of Spark includes Resilient Distributed Dataset (RDD) that provides parallel operations on top of maps and reduces function.
Why do We Need Apache Spark?
Apache Spark is a data processing engine that has been created with speed and scalability in mind.
It can be used as a programming library, as an interactive shell environment for data scientists and engineers, and as a standalone cluster computing system.
The programming libraries and shell environment both allow developers to leverage the power of distributed memory across clusters to work with big datasets.
How to Runs Spark Everywhere
Spark managed by Apache is an open-source fastest data processing framework in the world.
It has emerged as the engine of choice for large-scale data analytics across many different industries, powering everything from machine learning to interactive SQL queries.
For this reason, it is important that the framework is able to run on a multitude of operating systems and environments, letting users take advantage of their existing infrastructure environment while still gaining access to all of Spark’s capabilities.
Apache Spark possibility runs on the following platform:
- Hadoop
- Mesos
- Kubernetes
- Standalone
- Cloud Platforms
What are the Data Sources?
The Data Sources in Spark provide an abstraction layer to work with different data sources.
Apache Spark supports both cloud-native and traditional data sources. Spark allows users to work with these data sources as RDD (resilient distributed dataset), or write code that interacts with these data sources as DataFrames or Datasets that are linked via tables and partitions.
If you are thinking about using Apache Spark for big data processing, the question on your mind is probably: what data sources does Spark support?
Well, that depends on whether you’re working in RDDs or DataFrames. RDDs (Resilient Distributed Datasets) are Spark’s fundamental abstraction for working with different data source types.
These data source types include text files, various binary formats, Cassandra tables, and HBase tables.
DataFrames are a more advanced abstraction for manipulating and storing structured data in a relational way with NoSQL databases such as HBase, Cassandra, and Amazon S3.
What are the Spark Setup Modes?
There are four modes in which Apache Spark can be set up – the standalone mode, the cluster mode, the Mesos mode, and the YARN mode.
Standalone Mode: This is useful for development environments where you have all of the necessary libraries installed.
Cluster Mode: As its name suggests, this setup enables you to launch a cluster on your local machine by creating a driver program with a predefined number of executors.
Mesos Mode: If you’re working on Apache Mesos, then this is probably what you want to use.
YARN Mode: This is for Hadoop YARN clusters only.
Spark Apache Architecture
The Spark architecture is designed to take advantage of the CPU’s multi-core design.
It divides data across the nodes in different ways so that it can be processed in parallel.
Apache Spark can run on a variety of operating systems, including Linux, Windows, and macOS.
Spark is an open-source system for exploring large amounts of data using clusters of computers.
In the world of Big Data, tools and frameworks keep on evolving faster than we can handle.
Spark is one framework rising in popularity and being adopted by a large number of companies because of its proven speed (and scalability).
Spark provides API compatibility with the MapReduce system for Hadoop but is actually built on a cluster computing framework called Mesos.
The Spark API exposes two types of operations: transformations and actions.
Spark Architecture is a revolutionary data processing framework, developed by Apache Software Foundation.
This open-source architecture quickly gained popularity for its fast analytical and data processing capabilities and is used by big players such as Yahoo, LinkedIn, eBay, and others.
Related Article: Apache Spark Architecture Detail Explained.
What is RDD?
If you’ve ever tried to learn Apache Spark, you probably first heard about RDDs before DataFrames.
What are RDDs? RDDs (Resilient Distributed Datasets) are the fundamental abstraction for working with different data source types in Apache Spark.
DataFrames are Spark’s evolution of RDDs. They tend to be easier to work with and more convenient, which is why they are used instead of RDDs in most situations.
But sometimes, people coming from an RDD background get a bit confused when someone tells them that some operation is only available on DataFrames.
RDD is short for Resilient Distributed Dataset and is at the core of Apache Spark and this is a fault-tolerant collection of elements operated in parallel.
Spark’s execution engine manages availability, latency, and redundancy at scale by transparently replicating data across nodes using general network infrastructure.
How does RDD Works in Spark?
Resilient Distributed Datasets, or RDD for short, is the core abstraction in Apache Spark.
It allows you to work with data distributed in a cluster. A cluster consists of multiple nodes and each node has multiple cores.
Data in a Spark application is processed by dividing it into multiple partitions.
More than one partition of data can be present on a single node. Once the data is processed, it remains in the cluster in serialized form, which means that once any value of an RDD is computed, then other values cannot be computed without recomputing.
This makes Spark different from MapReduce because even though they are both based on parallel processing they manage data differently.
MapReduce stores intermediate results in memory and another operation can be carried out using these results whereas Spark doesn’t store intermediate results and hence the recomputation happens.
What are the RDD Operations?
In Spark, there are two types of operations that lead to the creation of a new child RDD.
RDDs can be created using two different operations: transformations (similar to a relational view) or actions (similar to an entity-attribute-value model).
In the following sections, we describe some of the crucial Operations like transformations and actions used when working with Spark.
RDD Transformations
Transformations operate on a parent RDD and return a new child RDD that has no dependencies.
Transformations such as map(), filter(), groupByKey(), reduceByKey() and aggregateByKey() can be applied to a parent RDD and return a new RDD with potential narrow and wide dependencies.
RDD Actions
Actions return candidate records or aggregated results without the creation of a child RDD.
Transitive actions such as collect, count, first and last return new RDDs that have no dependencies.
What are the RDD Dependencies?
What are RDD dependencies for? RDDs are “Resilient” in nature but there are certain situations where they would not be able to self-heal and hence to recover from failure is known as “Dependency”.
The above diagram depicts how Narrow and Wide dependencies work in Apache Spark.
Certain transformations (e.g map, filter, repartitions) are applied on the parent RDD can create a new child RDD that has narrow or wide dependencies.
There are two types of dependencies are available in apache spark RDDs and those are Narrow and wide to partition the execution flow and create the right lineage.
Understanding the different types of dependencies available in RDDs is fundamental to the concept.
There are two types of dependencies in an RDD are as follows:
Narrow Dependency
Narrow dependency, when two RDDs have exactly one partition in common.
Narrow dependencies are used for local transformations on the RDD that do not involve data shuffling.
In the case of the wide dependencies, the output from the map or flatMap operations on an RDD is fed to a child RDD and the results from parent RDD and child RDD will be shuffled across different nodes.
Wide Dependency
Wide dependency, when two RDDs have several partitions in common.
Wide transformations can be done using the repartition or coalesce transformations on the RDD.
These transformations will be run on each partition in a subset of the partitions and then these values will be replicated to all other partitions – hence a shuffle (not necessarily a physical shuffle).
There are a lot of options to these transformations that are documented here if you are interested in looking at them more closely.
Conclusion
Spark world is the fastest data processing engine correctly and it is highly used in industry for large scale data processing.
It provides high-level APIs in Java, Scala and Python, and an optimized engine that supports a variety of distributed storage systems, including Hadoop File System, Apache HDFS, and HBase.
Meet Nitin, a seasoned professional in the field of data engineering. With a Post Graduation in Data Science and Analytics, Nitin is a key contributor to the healthcare sector, specializing in data analysis, machine learning, AI, blockchain, and various data-related tools and technologies. As the Co-founder and editor of analyticslearn.com, Nitin brings a wealth of knowledge and experience to the realm of analytics. Join us in exploring the exciting intersection of healthcare and data science with Nitin as your guide.