How to master building a big data pipeline with Spark

Introduction Prerequisites Exercises Data Engineering Concepts What is data engineering?Data engineering vs Data Analytics vs Data Science Data Engineering functions Batch vs Real-time processing Data Engineering with Spark Spark Capabilities for ETL Spark architecture review Parallel processing with Spark Spark execution plan Stateful stream processing Spark analytics and ML Batch Processing Pipelines Problem statement Setting up the local DB Upload stock to a central store Aggregating stock across warehouses Real-time Processing Pipelines Real-time use case: Problem Generating a visits data stream Building a website analytics job Executing the real-time pipeline Data Engineering Spark: Best Practices Batch vs. real-time options Scaling extraction and loading operations Scaling processing operations Building resiliency References

Introduction

We focus on how to build data engineering pipelines with Apache Spark

Prerequisites

Apache Spark

Basic
Structured streaming
SQL

Java concepts and operations

Data stores: Kafka, MariaDB, Redis

Docker

Exercises

Start Docker Desktop

Start Apache Spark, mariaDB, Kafka, Redis in spark-docker.yml


#cd "/home/chinhuy/learn/Ex_Files_Apache_Spark_EssT_Big_Data_Eng/Exercise Files"
#initial start
docker-compose -f spark-docker.yml up -d
#start services
docker-compose -f spark-docker.yml start
#stop services
docker-compose -f spark-docker.yml stop

#cd /home/chinhuy/tools/idea-IC-221.5787.30
#./bin/idea.sh

Data Engineering Concepts

What is data engineering?

Data Engineering will design and build systems that collect and analyze data to deliver insights and actions

Essential for large-scale data processing with low latency

New roles: data engineer, big data architect

Prerequisite for machine learning

Data engineering vs Data Analytics vs Data Science

Function	Data Engineering	Business Analytics	Data Science
Integrate Data Source	yes	ㅤ	yes
Build Data Pipeline	yes	ㅤ	yes
Process and Transform Data	yes	ㅤ	yes
Store Data	yes	ㅤ	yes
Dashboards and Reports	ㅤ	yes	yes
Exploratory Analytics	ㅤ	yes	yes
Statistical Modeling	ㅤ	yes	yes
Machine Learning	ㅤ	ㅤ	yes
Business Recommendations	ㅤ	yes	yes
Business Actions	yes	yes	yes

Data Engineering functions

Data Acquisition


flowchart TD
	classDef blue fill:#66deff,stroke:#000,color:#000
  S1[Acquisition]:::blue --> S2(Transport)
	S2 --> S3(Storage)
	S3 --> S4(Processing)
	S4 --> S5(Serving)

Format of data

Interfaces

Security

Reliability

Data Transport


flowchart TD
	classDef blue fill:#66deff,stroke:#000,color:#000
  S1[Acquisition] --> S2(Transport):::blue
	S2 --> S3(Storage)
	S3 --> S4(Processing)
	S4 --> S5(Serving)

Reliability and integrity

Scale

Latency

Cost

Data Storage


flowchart TD
	classDef blue fill:#66deff,stroke:#000,color:#000
  S1[Acquisition] --> S2(Transport)
	S2 --> S3(Storage):::blue
	S3 --> S4(Processing)
	S4 --> S5(Serving)

Flexibility

Schema design

Query patterns

High availability

Data Processing


flowchart TD
	classDef blue fill:#66deff,stroke:#000,color:#000
  S1[Acquisition] --> S2(Transport)
	S2 --> S3(Storage)
	S3 --> S4(Processing):::blue
	S4 --> S5(Serving)

Cleansing

Filtering

Enriching

Aggregating

Machine Learning

Data Serving


flowchart TD
	classDef blue fill:#66deff,stroke:#000,color:#000
  S1[Acquisition] --> S2(Transport)
	S2 --> S3(Storage)
	S3 --> S4(Processing)
	S4 --> S5(Serving):::blue

Pull vs push

Latency and high availability

Security

Skill levels

Flexibility of schema

Batch vs Real-time processing

Batch processing

Processing data in batches, with defined size and windows

Source data does not change during processing

Bounded datasets

High latency, creating output at the end of processing

Easy reproduction of processing and outputs

Real-time processing

Process data as they are created at source

Dynamic source data may change during processing: add, modify, or delete data

Unbounded datasets

Low latency requirements

State management

Data Engineering with Spark

Spark arguably the best technology for data engineering

It can do both batch and real-time

Native support for several popular data sources

Advanced parallel-processing capabilities

MapReduce, windowing, state management, joins

Graph and machine learning support

Spark Capabilities for ETL

Spark architecture review


flowchart LR
	classDef blue fill:#66deff,stroke:#000,color:#000
  db1[(Source DB)] --> data1[Data 1]:::blue
  subgraph "Driver Node"
			direction LR
		  data1 -.- data2[Data 2]:::blue
  end
data2[Data 2] --> db2[(Destination DB)]

	subgraph "Executor Node 1"
			direction TB
		  data1 -- Partition -->rdd11[RDD 11]:::blue
			rdd11 -- Transform -->rdd12[RDD 12]:::blue
			rdd12 -- Action -->rdd13[RDD 13]:::blue
  end
    
rdd13 -- Collect -->data2
	subgraph "Executor Node 2"
			direction TB
		  data1 -- Partition -->rdd21[RDD 21]:::blue
			rdd21 -- Transform -->rdd22[RDD 22]:::blue
			rdd22 -- Action -->rdd23[RDD 23]:::blue
  end
rdd23 -- Collect -->data2
rdd12 -- Shuffle -->rdd23

Driver node read Data 1

Data 1 is broken into partitions (RDDs) and distributed to executor nodes

Transform operation happens locally

Action operation creates shuffles

Driver node collects back results

Parallel processing with Spark

Processing data includes multiple activities

Reading data from sources
Transformations, filters, and checks
Aggregations
Writing data to sinks

Parallelism is needed in all activities

Non-parallelizable steps become bottlenecks

Reading Data from Sources

Spark supports out of the box parallel reads for various sources

JDBC - partitioning reads by column values
Kafka - each executor reads from subset of partitions

Custom parallelism if needed

Predicate pushdown optimizes the data read into memory

Transform data

Transformations are parallelized as much as possible by Spark

Design best practices can help optimize data processing

Perform map() type operations early in the cycle
Reduce shuffling for reduce() type operation
Repartitioning

Writing Data to Sinks

Support supports out-of-the-box parallel writes for various sinks

JDBC - concurrent appends from executors
Kafka - concurrent writes to topics

Custom parallelism if needed

Batching capabilities to optimize writes

Spark execution plan

Lazy execution - only an action triggers execution

Spark optimizer comes up with a physical plan

The physical plan optimizes for

Reduced I/O
Reduced shuffling
Reduced memory usage

Spark executors can read and write directly from external sources when they support parallel I/O (HDFS, Kafka, JDBC,…)

Stateful stream processing

Checkpoint

Save job state to a persistent location such as HDFS, S3

Recover job state when failures happen or the job is restarted

Save metadata and RDDs

Kafka offsets
State tracking by keys
RDDs requiring transformation across multiple batches

Watermarks

Used for event-time window operations when late data needs to be handled

Spark waits until all events for a given time window are expected to arrive

Spark keeps track of events and their ordering

State tracking by Keys

Tracks current state based on specified keys

Change the state of the key as new significant events arrive

UpdateStateByKey operation

New states published as a DStreem for downstream consumption

Spark analytics and ML

Spark analytics

Spark SQL provides SQL interface computations

Batch and streaming use cases

Simple to use, yet powerful

Process and analyze data in the same pipeline

Cascading analytics

Publish analytics results to databases or streams

Spark ML

Spark ML allows preprocessing of data for ML

Feature extraction, transformation, dimensionality reduction

ML algorithms - training and inference

Pipelines to stitch data extraction, preprocessing, and inference into a single flow

Integrated data engineering, analytics, and ML

Batch Processing Pipelines

Problem statement

Stock Aggregation: Business scenarios

An enterprise has warehouses across the globe

Each warehouse has a local data center

A stock management application runs in each warehouse

A local MariaDB database keeps track of warehouse stock

Stock maintained by item and day - opening stock, receipts, and issues

Stock Aggregation: Requirement

Create and manage a central, consolidated stock database

Item stock aggregated across locations on a daily basic

Batch processing to upload warehouse data into a central cloud and manage stock

Scalable to hundreds of warehouses

Stock Aggregation: Design


flowchart LR
	classDef blue fill:#66deff,stroke:#000,color:#000
	classDef brown fill:#C3C3C3,stroke:#000,color:#000

subgraph warehouse [ ]
	direction TB
	subgraph wLD [Warehouse: London]
		direction LR
	  whLD[(warehouse_stock)]:::blue --> uploaderJ1(Stock Uploader Job):::brown
	end
	subgraph wNY [Warehouse: New York]
		direction LR
	  whNY[(warehouse_stock)]:::blue --> uploaderJ2(Stock Uploader Job):::brown
	end
	subgraph wLA [Warehouse: Los Angeles]
		direction LR
	  whLA[(warehouse_stock)]:::blue --> uploaderJ3(Stock Uploader Job):::brown
	end
end
uploaderJ1 --> dfs
uploaderJ2 --> dfs
uploaderJ3 --> dfs
subgraph dataCenter [Center Data Center]
		direction LR
	  dfs[( DFS )]:::blue --> uploaderJ4(Stock Aggregation Job):::brown
		uploaderJ4 --> stock[(global stock)]:::blue
	end

Setting up the local DB

Upload stock to a central store

Aggregating stock across warehouses

Real-time Processing Pipelines

Real-time use case: Problem

Website Analytics: Business scenario

An enterprise runs an e-commerce website selling multiple items like Amazon

Used across multiple countries

Enterprise wants to track visits by users in real-time

Visit date
Country
Duration of the visit
Last action (catalog, FAQ, shopping cart, order)

Website Analytics: Requirements

Receive a stream of website visit records

Visit date, last action, country, duration

Compute in real-time

Five-second summaries by the last action
Running visit counters by country
Abandoned shopping carts (to a separate queue)

Website Analytics Design


flowchart LR
	classDef blue fill:#66deff,stroke:#000,color:#000
	classDef brown fill:#C3C3C3,stroke:#000,color:#000
	spark.streaming.website.visits -.- kafka1
  Eapp(Ecommerce App):::blue --> kafka1((Kafka)):::brown
	kafka1 --> job1(Website Analytics Job):::blue
spark.streaming.carts.abandoned -.- kafka2
	job1 --> kafka2((Kafka)):::brown
	
	country-stats -.- redis
	job1 --> redis[(Redis)]:::blue

website_stats.visit_stats -.- db
	job1 --> db[(MariaDB)]:::blue

Generating a visits data stream

Building a website analytics job

Executing the real-time pipeline

Data Engineering Spark: Best Practices

Batch vs. real-time options

Is Real-time really needed?

Tendency to build all pipelines in real-time

It will be super fast
It’s cool

Real-time comes with significant design complexities

Unbounded datasets
Windowing, watermarks, and missed events
State management
Reprocessing

When to choose real-time

Real-time responses or actions needed (few seconds of latency)

Input data is a continuous stream of events

Processing involves minimal aggregation and analytics

Compute resources are sufficiently available

Hybrid pipelines are possible

Scaling extraction and loading operations

Scaling Extraction

Spark support parallel extraction from various data sources

Analyze available options for the given source

Choose source technology and build source schema to suite parallel operations

Push down filtering operations to data sources

Build defenses against missed records and duplication

Scaling Loading

Each Spark executor can write independently to a data sink

Analyze how Spark works with a specific data sink technology (transactions, locking, serialization,…)

Limit the number of parallel connections to async based on capacity (repartition if needed)

Build defenses against missed and duplicate updates

Scaling processing operations

Filter unwanted data early in the pipeline

Minimize reduce operations and associated shuffling

Repartition when needed

Cache results when appropriate

Avoid actions that send data back to the driver program until the end

Building resiliency

Spark provides several resiliency capabilities against failures (task, stage, executor)

Monitor jobs and Spark clusters for failures

Use checkpoints for streaming jobs with persistent stores

Build resiliency in associated input stores and output sinks

Compute infrastructure resiliency is needed too

References

https://www.linkedin.com/learning/apache-spark-essential-training-big-data-engineering-14259237/data-engineering-with-spark?autoSkip=true&autoplay=true&resume=false

https://sparkbyexamples.com/spark/spark-shuffle-partitions/