How to master building a big data pipeline with Spark
✏️

How to master building a big data pipeline with Spark

ExerciseFiles
https://www.linkedin.com/ambry/?x-li-ambry-ep=AQKV1NdsLK7vwQAAAYQ3ZoyTZKpIXhUz-inWviUTMCP8I2DsyiwGIB3Z1SLQSglRhb1as6HyeDIXbPVLqdot5R3-vpC0hvPnhRvmwY-YXuaq91aNS0r7CgZZlGELvi6LpGyvrKftuHKr0nkl8a3tV9KKLC3HBeNbjp9vdQq04PUlsivqek5QbCKXSAq76ffgcBDEE0g5Q-KBtmczAbRct3XAWgH8REUc0JCC52Uu2QP37NjYPB7vtsisYsD1wMCdSx77pliTNMZ45521s1JP_83v4SQi7YiZ5ST74Jd1PC7g9L1ieD2VT6XbX5Rr0wONOQpGJj7lz5iUCfmueiwExoTOzmPGqcwU_TAvXG0wCtzwvR8lZBX1IdUYQKd_w_hoWoiPRWal0iSfZggpMwEs55Rbunlot0P2SlbJY_Nx-RVvXVDc2IexVIPX37ZZaUerm3a4TyEhuzaPQa93rbPUcbxgCQ2xD0xCfFEegfvSyIabfC07xrCJchumJU6c
Tags
Big Data
Data Engineering
ETL
Spark
Apache Spark
Data Pipeline
Batch Processing
Real-time Processing
Published
May 13, 2022

Introduction

  • We focus on how to build data engineering pipelines with Apache Spark

Prerequisites

  • Apache Spark
    • Basic
    • Structured streaming
    • SQL
  • Java concepts and operations
  • Data stores: Kafka, MariaDB, Redis
  • Docker

Exercises

  • Start Docker Desktop
  • Start Apache Spark, mariaDB, Kafka, Redis in spark-docker.yml
#cd "/home/chinhuy/learn/Ex_Files_Apache_Spark_EssT_Big_Data_Eng/Exercise Files" #initial start docker-compose -f spark-docker.yml up -d #start services docker-compose -f spark-docker.yml start #stop services docker-compose -f spark-docker.yml stop #cd /home/chinhuy/tools/idea-IC-221.5787.30 #./bin/idea.sh

Data Engineering Concepts

What is data engineering?

  • Data Engineering will design and build systems that collect and analyze data to deliver insights and actions
  • Essential for large-scale data processing with low latency
  • New roles: data engineer, big data architect
  • Prerequisite for machine learning

Data engineering vs Data Analytics vs Data Science

Function
Data Engineering
Business Analytics
Data Science
Integrate Data Source
yes
yes
Build Data Pipeline
yes
yes
Process and Transform Data
yes
yes
Store Data
yes
yes
Dashboards and Reports
yes
yes
Exploratory Analytics
yes
yes
Statistical Modeling
yes
yes
Machine Learning
yes
Business Recommendations
yes
yes
Business Actions
yes
yes
yes

Data Engineering functions

Data Acquisition
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition]:::blue --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving)
  • Format of data
  • Interfaces
  • Security
  • Reliability
Data Transport
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport):::blue S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving)
  • Reliability and integrity
  • Scale
  • Latency
  • Cost
Data Storage
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage):::blue S3 --> S4(Processing) S4 --> S5(Serving)
  • Flexibility
  • Schema design
  • Query patterns
  • High availability
Data Processing
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing):::blue S4 --> S5(Serving)
 
  • Cleansing
  • Filtering
  • Enriching
  • Aggregating
  • Machine Learning
 
 
Data Serving
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving):::blue
  • Pull vs push
  • Latency and high availability
  • Security
  • Skill levels
  • Flexibility of schema
 

Batch vs Real-time processing

Batch processing
  • Processing data in batches, with defined size and windows
  • Source data does not change during processing
  • Bounded datasets
  • High latency, creating output at the end of processing
  • Easy reproduction of processing and outputs
Real-time processing
  • Process data as they are created at source
  • Dynamic source data may change during processing: add, modify, or delete data
  • Unbounded datasets
  • Low latency requirements
  • State management

Data Engineering with Spark

  • Spark arguably the best technology for data engineering
  • It can do both batch and real-time
  • Native support for several popular data sources
  • Advanced parallel-processing capabilities
    • MapReduce, windowing, state management, joins
  • Graph and machine learning support

Spark Capabilities for ETL

Spark architecture review

flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 db1[(Source DB)] --> data1[Data 1]:::blue subgraph "Driver Node" direction LR data1 -.- data2[Data 2]:::blue end data2[Data 2] --> db2[(Destination DB)] subgraph "Executor Node 1" direction TB data1 -- Partition -->rdd11[RDD 11]:::blue rdd11 -- Transform -->rdd12[RDD 12]:::blue rdd12 -- Action -->rdd13[RDD 13]:::blue end rdd13 -- Collect -->data2 subgraph "Executor Node 2" direction TB data1 -- Partition -->rdd21[RDD 21]:::blue rdd21 -- Transform -->rdd22[RDD 22]:::blue rdd22 -- Action -->rdd23[RDD 23]:::blue end rdd23 -- Collect -->data2 rdd12 -- Shuffle -->rdd23
  1. Driver node read Data 1
  1. Data 1 is broken into partitions (RDDs) and distributed to executor nodes
  1. Transform operation happens locally
  1. Action operation creates shuffles
  1. Driver node collects back results

Parallel processing with Spark

  • Processing data includes multiple activities
    • Reading data from sources
    • Transformations, filters, and checks
    • Aggregations
    • Writing data to sinks
  • Parallelism is needed in all activities
  • Non-parallelizable steps become bottlenecks
Reading Data from Sources
  • Spark supports out of the box parallel reads for various sources
    • JDBC - partitioning reads by column values
    • Kafka - each executor reads from subset of partitions
  • Custom parallelism if needed
  • Predicate pushdown optimizes the data read into memory
Transform data
  • Transformations are parallelized as much as possible by Spark
  • Design best practices can help optimize data processing
    • Perform map() type operations early in the cycle
    • Reduce shuffling for reduce() type operation
    • Repartitioning
Writing Data to Sinks
  • Support supports out-of-the-box parallel writes for various sinks
    • JDBC - concurrent appends from executors
    • Kafka - concurrent writes to topics
  • Custom parallelism if needed
  • Batching capabilities to optimize writes

Spark execution plan

  • Lazy execution - only an action triggers execution
  • Spark optimizer comes up with a physical plan
  • The physical plan optimizes for
    • Reduced I/O
    • Reduced shuffling
    • Reduced memory usage
Spark executors can read and write directly from external sources when they support parallel I/O (HDFS, Kafka, JDBC,…)

Stateful stream processing

Checkpoint
  • Save job state to a persistent location such as HDFS, S3
  • Recover job state when failures happen or the job is restarted
  • Save metadata and RDDs
    • Kafka offsets
    • State tracking by keys
    • RDDs requiring transformation across multiple batches
Watermarks
  • Used for event-time window operations when late data needs to be handled
  • Spark waits until all events for a given time window are expected to arrive
  • Spark keeps track of events and their ordering
State tracking by Keys
  • Tracks current state based on specified keys
  • Change the state of the key as new significant events arrive
  • UpdateStateByKey operation
  • New states published as a DStreem for downstream consumption

Spark analytics and ML

Spark analytics
  • Spark SQL provides SQL interface computations
  • Batch and streaming use cases
  • Simple to use, yet powerful
  • Process and analyze data in the same pipeline
  • Cascading analytics
  • Publish analytics results to databases or streams
Spark ML
  • Spark ML allows preprocessing of data for ML
    • Feature extraction, transformation, dimensionality reduction
  • ML algorithms - training and inference
  • Pipelines to stitch data extraction, preprocessing, and inference into a single flow
  • Integrated data engineering, analytics, and ML

Batch Processing Pipelines

Problem statement

Stock Aggregation: Business scenarios
  • An enterprise has warehouses across the globe
  • Each warehouse has a local data center
  • A stock management application runs in each warehouse
  • A local MariaDB database keeps track of warehouse stock
  • Stock maintained by item and day - opening stock, receipts, and issues
Stock Aggregation: Requirement
  • Create and manage a central, consolidated stock database
  • Item stock aggregated across locations on a daily basic
  • Batch processing to upload warehouse data into a central cloud and manage stock
  • Scalable to hundreds of warehouses
Stock Aggregation: Design
flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 classDef brown fill:#C3C3C3,stroke:#000,color:#000 subgraph warehouse [ ] direction TB subgraph wLD [Warehouse: London] direction LR whLD[(warehouse_stock)]:::blue --> uploaderJ1(Stock Uploader Job):::brown end subgraph wNY [Warehouse: New York] direction LR whNY[(warehouse_stock)]:::blue --> uploaderJ2(Stock Uploader Job):::brown end subgraph wLA [Warehouse: Los Angeles] direction LR whLA[(warehouse_stock)]:::blue --> uploaderJ3(Stock Uploader Job):::brown end end uploaderJ1 --> dfs uploaderJ2 --> dfs uploaderJ3 --> dfs subgraph dataCenter [Center Data Center] direction LR dfs[( DFS )]:::blue --> uploaderJ4(Stock Aggregation Job):::brown uploaderJ4 --> stock[(global stock)]:::blue end

Setting up the local DB

Upload stock to a central store

Aggregating stock across warehouses

Real-time Processing Pipelines

Real-time use case: Problem

Website Analytics: Business scenario
  • An enterprise runs an e-commerce website selling multiple items like Amazon
  • Used across multiple countries
  • Enterprise wants to track visits by users in real-time
    • Visit date
    • Country
    • Duration of the visit
    • Last action (catalog, FAQ, shopping cart, order)
Website Analytics: Requirements
  • Receive a stream of website visit records
    • Visit date, last action, country, duration
  • Compute in real-time
    • Five-second summaries by the last action
    • Running visit counters by country
    • Abandoned shopping carts (to a separate queue)
Website Analytics Design
flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 classDef brown fill:#C3C3C3,stroke:#000,color:#000 spark.streaming.website.visits -.- kafka1 Eapp(Ecommerce App):::blue --> kafka1((Kafka)):::brown kafka1 --> job1(Website Analytics Job):::blue spark.streaming.carts.abandoned -.- kafka2 job1 --> kafka2((Kafka)):::brown country-stats -.- redis job1 --> redis[(Redis)]:::blue website_stats.visit_stats -.- db job1 --> db[(MariaDB)]:::blue

Generating a visits data stream

Building a website analytics job

Executing the real-time pipeline

Data Engineering Spark: Best Practices

Batch vs. real-time options

Is Real-time really needed?
  • Tendency to build all pipelines in real-time
    • It will be super fast
    • It’s cool
  • Real-time comes with significant design complexities
    • Unbounded datasets
    • Windowing, watermarks, and missed events
    • State management
    • Reprocessing
When to choose real-time
  • Real-time responses or actions needed (few seconds of latency)
  • Input data is a continuous stream of events
  • Processing involves minimal aggregation and analytics
  • Compute resources are sufficiently available
  • Hybrid pipelines are possible

Scaling extraction and loading operations

Scaling Extraction
  • Spark support parallel extraction from various data sources
  • Analyze available options for the given source
  • Choose source technology and build source schema to suite parallel operations
  • Push down filtering operations to data sources
  • Build defenses against missed records and duplication
Scaling Loading
  • Each Spark executor can write independently to a data sink
  • Analyze how Spark works with a specific data sink technology (transactions, locking, serialization,…)
  • Limit the number of parallel connections to async based on capacity (repartition if needed)
  • Build defenses against missed and duplicate updates

Scaling processing operations

  • Filter unwanted data early in the pipeline
  • Minimize reduce operations and associated shuffling
  • Repartition when needed
  • Cache results when appropriate
  • Avoid actions that send data back to the driver program until the end

Building resiliency

  • Spark provides several resiliency capabilities against failures (task, stage, executor)
  • Monitor jobs and Spark clusters for failures
  • Use checkpoints for streaming jobs with persistent stores
  • Build resiliency in associated input stores and output sinks
  • Compute infrastructure resiliency is needed too

References