What is spark?

19 Jan 2024

Spark is a cluster computing framework for data processing.

It doesnt use MapReduce
Great at keeping large datasets in memory between jobs. (Unlike mapreduce which loads data every time)
Making it great at iterative algorithms and interactive analysis
Also has a great DAG engine that can process arbitrary pipelines into single jobs.

Resilient Distributed Datasets are read only object collections that you play with in Spark.

Jobs run in applications. Each interactive spark session is an application. Jobs can access cached data from previous jobs.

You can create these by

You can aggregate RDD by keys with ReduceByKey(), foldByKey() and aggregateByKey().

You can cache RDDs by calling cache()!

Usually data is just serialised with Java serialisation.

How do we access data outside of RDDs?

We can use broadcast variables. Variables that are stored in cache across jobs.

We can also use Accumulators. Kinda like counters. After jobs are completed, these accumulator info can be retrieved.

Fairly simple driver contains application info which runs executors.

Tasks are either Shuffle map tasks or Result tasks.

Cluster managers manage the executors.

oboe