Bagaimana cara kerja Spark mengenai Big Data

Rakha Asyrofi
13 min readJul 5, 2020

--

Why Learn Spark?

Spark saat ini adalah salah satu alat paling populer untuk analitik data besar. Anda mungkin pernah mendengar alat lain seperti Hadoop. Hadoop adalah teknologi yang sedikit lebih tua meskipun masih digunakan oleh beberapa perusahaan. Spark umumnya lebih cepat daripada Hadoop, itulah sebabnya Spark telah menjadi lebih populer selama beberapa tahun terakhir.

Gambar 1. Ilustrasi spark

Ada banyak alat dan sistem data besar lainnya, masing-masing dengan kasus penggunaannya sendiri. Misalnya, ada sistem basis data seperti Apache Cassandra dan SQL query engines seperti Presto. Tetapi Spark masih merupakan salah satu alat paling populer untuk menganalisis set data besar.

Here is an outline of the topics we are covering in this lesson:

  • What is big data?
  • Review of the hardware behind big data
  • Introduction to distributed systems
  • Brief history of Spark and big data
  • Common Spark use cases
  • Other technologies in the big data ecosystem

In the next few videos, you’ll learn about four key hardware components. Understanding these components helps determine whether you are working on a “big data” problem or if it’s easier to analyze the data locally on your own computer.

CPU (Central Processing Unit)
The CPU is the “brain” of the computer. Every process on your computer is eventually handled by your CPU. This includes calculations and also instructions for the other components of the compute.

Memory (RAM)
When your program runs, data gets temporarily stored in memory before getting sent to the CPU. Memory is ephemeral storage — when your computer shuts down, the data in the memory is lost.

Storage (SSD or Magnetic Disk)
Storage is used for keeping data over long periods of time. When a program runs, the CPU will direct the memory to temporarily load data from long-term storage.

Network (LAN or the Internet)
Network is the gateway for anything that you need that isn’t stored on your computer. The network could connect to other computers in the same room (a Local Area Network) or to a computer on the other side of the world, connected over the internet.

Other Numbers to Know?
You may have noticed a few other numbers involving the L1 and L2 Cache, mutex locking, and branch mispredicts. While these concepts are important for a detailed understanding of what’s going on inside your computer, you don’t need to worry about them for this course. If you’re curious to learn more, check out Peter Norvig’s original blog post from a few years ago, and an interactive version for today’s current hardware.

The CPU is the brains of a computer. The CPU has a few different functions including directing other components of a computer as well as running mathematical calculations. The CPU can also store small amounts of data inside itself in what are called registers. These registers hold data that the CPU is working with at the moment.

For example, say you write a program that reads in a 40 MB data file and then analyzes the file. When you execute the code, the instructions are loaded into the CPU. The CPU then instructs the computer to take the 40 MB from disk and store the data in memory (RAM). If you want to sum a column of data, then the CPU will essentially take two numbers at a time and sum them together. The accumulation of the sum needs to be stored somewhere while the CPU grabs the next number.

This cumulative sum will be stored in a register. The registers make computations more efficient: the registers avoid having to send data unnecessarily back and forth between memory (RAM) and the CPU.

Beyond the fact that memory is expensive and ephemeral, we’ll learn that for most use cases in the industry, memory and CPU aren’t the bottleneck. Instead the storage and network, which you’ll learn about in the next videos, slow down many tasks you’ll work on in the industry

Medium Data Numbers

If a dataset is larger than the size of your RAM, you might still be able to analyze the data on a single computer. By default, the Python pandas library will read in an entire dataset from disk into memory. If the dataset is larger than your computer’s memory, the program won’t work.

However, the Python pandas library can read in a file in smaller chunks. Thus, if you were going to calculate summary statistics about the dataset such as a sum or count, you could read in a part of the dataset at a time and accumulate the sum or count.

Here is an example of how this works.

Hadoop Vocabulary

Here is a list of some terms associated with Hadoop. You’ll learn more about these terms and how they relate to Spark in the rest of the lesson.

  • Hadoop — an ecosystem of tools for big data storage and data analysis. Hadoop is an older system than Spark but is still used by many companies. The major difference between Spark and Hadoop is how they use memory. Hadoop writes intermediate results to disk whereas Spark tries to keep data in memory whenever possible. This makes Spark faster for many use cases.
  • Hadoop MapReduce — a system for processing and analyzing large data sets in parallel.
  • Hadoop YARN — a resource manager that schedules jobs across a cluster. The manager keeps track of what computer resources are available and then assigns those resources to specific tasks.
  • Hadoop Distributed File System (HDFS) — a big data storage system that splits data into chunks and stores the chunks across a cluster of computers.

As Hadoop matured, other tools were developed to make Hadoop easier to work with. These tools included:

  • Apache Pig — a SQL-like language that runs on top of Hadoop MapReduce
  • Apache Hive — another SQL-like interface that runs on top of Hadoop MapReduce

Oftentimes when someone is talking about Hadoop in general terms, they are actually talking about Hadoop MapReduce. However, Hadoop is more than just MapReduce. In the next part of the lesson, you’ll learn more about how MapReduce works.

How is Spark related to Hadoop?

Spark, which is the main focus of this course, is another big data framework. Spark contains libraries for data analysis, machine learning, graph analysis, and streaming live data. Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk whereas Spark tries to keep intermediate results in memory whenever possible.

The Hadoop ecosystem includes a distributed file storage system called HDFS (Hadoop Distributed File System). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.

Streaming Data

Data streaming is a specialized topic in big data. The use case is when you want to store and analyze data in real-time such as Facebook posts or Twitter tweets.

Spark has a streaming library called Spark Streaming although it is not as popular and fast as some other streaming libraries. Other popular streaming libraries include Storm and Flink. Streaming won’t be covered in this course, but you can follow these links to learn more about these technologies.

MapReduce

MapReduce is a programming technique for manipulating large data sets. “Hadoop MapReduce” is a specific implementation of this programming technique.

The technique works by first dividing up a large dataset and distributing the data across a cluster. In the map step, each data is analyzed and converted into a (key, value) pair. Then these key-value pairs are shuffled across the cluster so that all keys are on the same machine. In the reduce step, the values with the same keys are combined together.

While Spark doesn’t implement MapReduce, you can write Spark programs that behave in a similar way to the map-reduce paradigm. In the next section, you will run through a code example.

Spark Use Cases and Resources

Here are a few resources about different Spark use cases:

You Don’t Always Need Spark

Spark is meant for big data sets that cannot fit on one computer. But you don’t need Spark if you are working on smaller data sets. In the cases of data sets that can fit on your local computer, there are many other options out there you can use to manipulate data such as:

  • AWK — a command line tool for manipulating text files
  • R — a programming language and software environment for statistical computing
  • Python PyData Stack, which includes pandas, Matplotlib, NumPy, and scikit-learn among other libraries

Sometimes, you can still use pandas on a single, local machine even if your data set is only a little bit larger than memory. Pandas can read data in chunks. Depending on your use case, you can filter the data and write out the relevant parts to disk.

If the data is already stored in a relational database such as MySQL or Postgres, you can leverage SQL to extract, filter and aggregate the data. If you would like to leverage pandas and SQL simultaneously, you can use libraries such as SQLAlchemy, which provides an abstraction layer to manipulate SQL tables with generative Python expressions.

The most commonly used Python Machine Learning library is scikit-learn. It has a wide range of algorithms for classification, regression, and clustering, as well as utilities for preprocessing data, fine tuning model parameters and testing their results. However, if you want to use more complex algorithms — like deep learning — you’ll need to look further. TensorFlow and PyTorch are currently popular packages.

Spark’s Limitations

Spark has some limitation.

Spark Streaming’s latency is at least 500 milliseconds since it operates on micro-batches of records, instead of processing one record at a time. Native streaming tools such as Storm, Apex, or Flink can push down this latency value and might be more suitable for low-latency applications. Flink and Apex can be used for batch computation as well, so if you’re already using them for stream processing, there’s no need to add Spark to your stack of technologies.

Another limitation of Spark is its selection of machine learning algorithms. Currently, Spark only supports algorithms that scale linearly with the input data size. In general, deep learning is not available either, though there are many projects integrate Spark with Tensorflow and other deep learning tools.

Hadoop versus Spark

The Hadoop ecosystem is a slightly older technology than the Spark ecosystem. In general, Hadoop MapReduce is slower than Spark because Hadoop writes data out to disk during intermediate steps. However, many big companies, such as Facebook and LinkedIn, started using Big Data early and built their infrastructure around the Hadoop ecosystem.

While Spark is great for iterative algorithms, there is not much of a performance boost over Hadoop MapReduce when doing simple counting. Migrating legacy code to Spark, especially on hundreds of nodes that are already in production, might not be worth the cost for the small performance boost.

Beyond Spark for Storing and Processing Big Data

Keep in mind that Spark is not a data storage system, and there are a number of tools besides Spark that can be used to process and analyze large datasets.

Sometimes it makes sense to use the power and simplicity of SQL on big data. For these cases, a new class of databases, know as NoSQL and NewSQL, have been developed.

For example, you might hear about newer database storage systems like HBase or Cassandra. There are also distributed SQL engines like Impala and Presto. Many of these technologies use query syntax that you are likely already familiar with based on your experiences with Python and SQL.

In the lessons ahead, you will learn about Spark specifically, but know that many of the skills you already have with SQL, Python, and soon enough, Spark, will also be useful if you end up needing to learn any of these additional Big Data tools.

In this lesson, you’ll practice wrangling data with Spark. If you are familiar with both SQL and Python’s pandas library, you’ll notice quite a few similarities with the Spark SQL module and Spark Dataset API.

Lesson Overview

  • Wrangling data with Spark
  • Functional programming
  • Read in and write out data
  • Spark environment and Spark APIs
  • RDD API

Functions

In the previous video, we’ve used a number of functions to manipulate our dataframe. Let’s take a look at the different type of functions and their potential pitfalls.

General functions

We have used the following general functions that are quite similar to methods of pandas dataframes:

  • select(): returns a new DataFrame with the selected columns
  • filter(): filters rows using the given condition
  • where(): is just an alias for filter()
  • groupBy(): groups the DataFrame using the specified columns, so we can run aggregation on them
  • sort(): returns a new DataFrame sorted by the specified column(s). By default the second parameter 'ascending' is True.
  • dropDuplicates(): returns a new DataFrame with unique rows based on all or just a subset of columns
  • withColumn(): returns a new DataFrame by adding a column or replacing the existing column that has the same name. The first parameter is the name of the new column, the second is an expression of how to compute it.

Aggregate functions

Spark SQL provides built-in methods for the most common aggregations such as count(), countDistinct(), avg(), max(), min(), etc. in the pyspark.sql.functions module. These methods are not the same as the built-in methods in the Python Standard Library, where we can find min() for example as well, hence you need to be careful not to use them interchangeably.

In many cases, there are multiple ways to express the same aggregations. For example, if we would like to compute one type of aggregate for one or more columns of the DataFrame we can just simply chain the aggregate method after a groupBy(). If we would like to use different functions on different columns, agg()comes in handy. For example agg({"salary": "avg", "age": "max"}) computes the average salary and maximum age.

User defined functions (UDF)

In Spark SQL we can define our own functions with the udf method from the pyspark.sql.functions module. The default type of the returned variable for UDFs is string. If we would like to return an other type we need to explicitly do so by using the different types from the pyspark.sql.types module.

Window functions

Window functions are a way of combining the values of ranges of rows in a DataFrame. When defining the window we can choose how to sort and group (with the partitionBy method) the rows and how wide of a window we'd like to use (described by rangeBetween or rowsBetween).

For further information see the Spark SQL, DataFrames and Datasets Guide and the Spark Python API Docs.

Spark SQL resources

Here are a few resources that you might find helpful when working with Spark SQL

RDDs are a low-level abstraction of the data. In the first version of Spark, you worked directly with RDDs. You can think of RDDs as long lists distributed across various machines. You can still use RDDs as part of your Spark code although data frames and SQL are easier. This course won’t go into the details of RDD syntax, but you can find some further explanation of the difference between RDDs and DataFrames in Databricks’ A Tale of Three Apache Spark APIs: RDDs, DataFrames, and Datasets blog post.

Here is a link to the Spark documentation’s RDD programming guide.

Lesson Overview

By the end of the lesson, you will be able to:

  • Distinguish between setting up a Spark Cluster using both Local and Standalone Mode
  • Set up Spark Cluster in AWS
  • Use Spark UI
  • Use AWS CLI
  • Create EMR using AWS CLI
  • Create EMR Cluster
  • Test Port Forwarding
  • Use Notebooks on your Spark Cluster
  • Write Spark Scripts
  • Store and Retrieve data on the Cloud
  • Read and Write to Amazon S3
  • Understand the distinction between HDFS and S3
  • Reading and Writing Data to HDF

Overview of the Set up of a Spark Cluster

  1. Amazon S3 will store the dataset.
  2. We rent a cluster of machines, i.e., our Spark Cluster, and iti s located in AWS data centers. We rent these using AWS service called Elastic Compute Cloud (EC2).
  3. We log in from your local computer to this Spark cluster.
  4. Upon running our Spark code, the cluster will load the dataset from Amazon S3 into the cluster’s memory distributed across each machine in the cluster.

New Terms:

  • Local mode: You are running a Spark program on your laptop like a single machine.
  • Standalone mode: You are defining Spark Primary and Secondary to work on your (virtual) machine. You can do this on EMR or your machine. Standalone mode uses a resource manager like YARN or Mesos.

EC2 vs EMR

AWS EMRAWS EC2Distributed computingYesYesNode categorizationCategorizes secondary nodes into core and task nodes as a result of which data can be lost in case a data node is removed.Does not use node categorizationCan support HDFS?YesOnly if you configure HDFS on EC2 yourself using multi-step process.What protocol can be used?Uses S3 protocol over AWS S3, which is faster than s3a protocolECS uses s3aComparison costBit higherLower

Circling back about HDFS

Previously we have looked over the Hadoop Ecosystem. To refresh those concepts, we have provided reference material here. HDFS (Hadoop Distributed File System) is the file system. HDFS uses MapReduce system as a resource manager.

Spark can replace the MapReduce algorithm. Since Spark does not have its own distributed storage system, it leverages using HDFS or AWS S3, or any other distributed storage. Primarily in this course, we will be using AWS S3, but let’s review the advantages of using HDFS over AWS S3.

What is HDFS?

HDFS (Hadoop Distributed File System) is the file system in the Hadoop ecosystem. Hadoop and Spark are two frameworks providing tools for carrying out big-data related tasks. While Spark is faster than Hadoop, Spark has one drawback. It lacks a distributed storage system. In other words, Spark lacks a system to organize, store and process data files.

MapReduce System

HDFS uses MapReduce system as a resource manager to allow the distribution of the files across the hard drives within the cluster. Think of it as the MapReduce System storing the data back on the hard drives after completing all the tasks.

Spark, on the other hand, runs the operations and holds the data in the RAM memory rather than the hard drives used by HDFS. Since Spark lacks a file distribution system to organize, store and process data files, Spark tools are often installed on Hadoop because Spark can then use the Hadoop Distributed File System (HDFS).

Why do you need EMR Cluster?

Since a Spark cluster includes multiple machines, in order to use Spark code on each machine, we would need to download and install Spark and its dependencies. This is a manual process. Elastic Map Reduce is a service offered by AWS that negates the need for you, the user, to go through the manual process of installing Spark and its dependencies for each machine.

Setting up AWS

Please refer to the latest AWS documentation to set up an EMR Cluster.

--

--

No responses yet