IntroductionPrerequisitesExercisesData Engineering ConceptsWhat is data engineering?Data engineering vs Data Analytics vs Data ScienceData Engineering functionsBatch vs Real-time processingData Engineering with SparkSpark Capabilities for ETLSpark architecture reviewParallel processing with SparkSpark execution planStateful stream processingSpark analytics and MLBatch Processing PipelinesProblem statementSetting up the local DBUpload stock to a central storeAggregating stock across warehousesReal-time Processing PipelinesReal-time use case: ProblemGenerating a visits data streamBuilding a website analytics jobExecuting the real-time pipelineData Engineering Spark: Best PracticesBatch vs. real-time optionsScaling extraction and loading operationsScaling processing operationsBuilding resiliencyReferences
Introduction
- We focus on how to build data engineering pipelines with Apache Spark
Prerequisites
- Apache Spark
- Basic
- Structured streaming
- SQL
- Java concepts and operations
- Data stores: Kafka, MariaDB, Redis
- Docker
Exercises
- Start Docker Desktop
- Start Apache Spark, mariaDB, Kafka, Redis in spark-docker.yml
#cd "/home/chinhuy/learn/Ex_Files_Apache_Spark_EssT_Big_Data_Eng/Exercise Files" #initial start docker-compose -f spark-docker.yml up -d #start services docker-compose -f spark-docker.yml start #stop services docker-compose -f spark-docker.yml stop #cd /home/chinhuy/tools/idea-IC-221.5787.30 #./bin/idea.sh
Data Engineering Concepts
What is data engineering?
- Data Engineering will design and build systems that collect and analyze data to deliver insights and actions
- Essential for large-scale data processing with low latency
- New roles: data engineer, big data architect
- Prerequisite for machine learning
Data engineering vs Data Analytics vs Data Science
Function | Data Engineering | Business Analytics | Data Science |
Integrate Data Source | yes | ㅤ | yes |
Build Data Pipeline | yes | ㅤ | yes |
Process and Transform Data | yes | ㅤ | yes |
Store Data | yes | ㅤ | yes |
Dashboards and Reports | ㅤ | yes | yes |
Exploratory Analytics | ㅤ | yes | yes |
Statistical Modeling | ㅤ | yes | yes |
Machine Learning | ㅤ | ㅤ | yes |
Business Recommendations | ㅤ | yes | yes |
Business Actions | yes | yes | yes |
Data Engineering functions
Data Acquisition
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition]:::blue --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving)
- Format of data
- Interfaces
- Security
- Reliability
Data Transport
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport):::blue S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving)
- Reliability and integrity
- Scale
- Latency
- Cost
Data Storage
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage):::blue S3 --> S4(Processing) S4 --> S5(Serving)
- Flexibility
- Schema design
- Query patterns
- High availability
Data Processing
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing):::blue S4 --> S5(Serving)
- Cleansing
- Filtering
- Enriching
- Aggregating
- Machine Learning
Data Serving
flowchart TD classDef blue fill:#66deff,stroke:#000,color:#000 S1[Acquisition] --> S2(Transport) S2 --> S3(Storage) S3 --> S4(Processing) S4 --> S5(Serving):::blue
- Pull vs push
- Latency and high availability
- Security
- Skill levels
- Flexibility of schema
Batch vs Real-time processing
Batch processing
- Processing data in batches, with defined size and windows
- Source data does not change during processing
- Bounded datasets
- High latency, creating output at the end of processing
- Easy reproduction of processing and outputs
Real-time processing
- Process data as they are created at source
- Dynamic source data may change during processing: add, modify, or delete data
- Unbounded datasets
- Low latency requirements
- State management
Data Engineering with Spark
- Spark arguably the best technology for data engineering
- It can do both batch and real-time
- Native support for several popular data sources
- Advanced parallel-processing capabilities
- MapReduce, windowing, state management, joins
- Graph and machine learning support
Spark Capabilities for ETL
Spark architecture review
flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 db1[(Source DB)] --> data1[Data 1]:::blue subgraph "Driver Node" direction LR data1 -.- data2[Data 2]:::blue end data2[Data 2] --> db2[(Destination DB)] subgraph "Executor Node 1" direction TB data1 -- Partition -->rdd11[RDD 11]:::blue rdd11 -- Transform -->rdd12[RDD 12]:::blue rdd12 -- Action -->rdd13[RDD 13]:::blue end rdd13 -- Collect -->data2 subgraph "Executor Node 2" direction TB data1 -- Partition -->rdd21[RDD 21]:::blue rdd21 -- Transform -->rdd22[RDD 22]:::blue rdd22 -- Action -->rdd23[RDD 23]:::blue end rdd23 -- Collect -->data2 rdd12 -- Shuffle -->rdd23
- Driver node read Data 1
- Data 1 is broken into partitions (RDDs) and distributed to executor nodes
- Transform operation happens locally
- Action operation creates shuffles
- Driver node collects back results
Parallel processing with Spark
- Processing data includes multiple activities
- Reading data from sources
- Transformations, filters, and checks
- Aggregations
- Writing data to sinks
- Parallelism is needed in all activities
- Non-parallelizable steps become bottlenecks
Reading Data from Sources
- Spark supports out of the box parallel reads for various sources
- JDBC - partitioning reads by column values
- Kafka - each executor reads from subset of partitions
- Custom parallelism if needed
- Predicate pushdown optimizes the data read into memory
Transform data
- Transformations are parallelized as much as possible by Spark
- Design best practices can help optimize data processing
- Perform map() type operations early in the cycle
- Reduce shuffling for reduce() type operation
- Repartitioning
Writing Data to Sinks
- Support supports out-of-the-box parallel writes for various sinks
- JDBC - concurrent appends from executors
- Kafka - concurrent writes to topics
- Custom parallelism if needed
- Batching capabilities to optimize writes
Spark execution plan
- Lazy execution - only an action triggers execution
- Spark optimizer comes up with a physical plan
- The physical plan optimizes for
- Reduced I/O
- Reduced shuffling
- Reduced memory usage
Spark executors can read and write directly from external sources when they support parallel I/O (HDFS, Kafka, JDBC,…)
Stateful stream processing
Checkpoint
- Save job state to a persistent location such as HDFS, S3
- Recover job state when failures happen or the job is restarted
- Save metadata and RDDs
- Kafka offsets
- State tracking by keys
- RDDs requiring transformation across multiple batches
Watermarks
- Used for event-time window operations when late data needs to be handled
- Spark waits until all events for a given time window are expected to arrive
- Spark keeps track of events and their ordering
State tracking by Keys
- Tracks current state based on specified keys
- Change the state of the key as new significant events arrive
- UpdateStateByKey operation
- New states published as a DStreem for downstream consumption
Spark analytics and ML
Spark analytics
- Spark SQL provides SQL interface computations
- Batch and streaming use cases
- Simple to use, yet powerful
- Process and analyze data in the same pipeline
- Cascading analytics
- Publish analytics results to databases or streams
Spark ML
- Spark ML allows preprocessing of data for ML
- Feature extraction, transformation, dimensionality reduction
- ML algorithms - training and inference
- Pipelines to stitch data extraction, preprocessing, and inference into a single flow
- Integrated data engineering, analytics, and ML
Batch Processing Pipelines
Problem statement
Stock Aggregation: Business scenarios
- An enterprise has warehouses across the globe
- Each warehouse has a local data center
- A stock management application runs in each warehouse
- A local MariaDB database keeps track of warehouse stock
- Stock maintained by item and day - opening stock, receipts, and issues
Stock Aggregation: Requirement
- Create and manage a central, consolidated stock database
- Item stock aggregated across locations on a daily basic
- Batch processing to upload warehouse data into a central cloud and manage stock
- Scalable to hundreds of warehouses
Stock Aggregation: Design
flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 classDef brown fill:#C3C3C3,stroke:#000,color:#000 subgraph warehouse [ ] direction TB subgraph wLD [Warehouse: London] direction LR whLD[(warehouse_stock)]:::blue --> uploaderJ1(Stock Uploader Job):::brown end subgraph wNY [Warehouse: New York] direction LR whNY[(warehouse_stock)]:::blue --> uploaderJ2(Stock Uploader Job):::brown end subgraph wLA [Warehouse: Los Angeles] direction LR whLA[(warehouse_stock)]:::blue --> uploaderJ3(Stock Uploader Job):::brown end end uploaderJ1 --> dfs uploaderJ2 --> dfs uploaderJ3 --> dfs subgraph dataCenter [Center Data Center] direction LR dfs[( DFS )]:::blue --> uploaderJ4(Stock Aggregation Job):::brown uploaderJ4 --> stock[(global stock)]:::blue end
Setting up the local DB
Upload stock to a central store
Aggregating stock across warehouses
Real-time Processing Pipelines
Real-time use case: Problem
Website Analytics: Business scenario
- An enterprise runs an e-commerce website selling multiple items like Amazon
- Used across multiple countries
- Enterprise wants to track visits by users in real-time
- Visit date
- Country
- Duration of the visit
- Last action (catalog, FAQ, shopping cart, order)
Website Analytics: Requirements
- Receive a stream of website visit records
- Visit date, last action, country, duration
- Compute in real-time
- Five-second summaries by the last action
- Running visit counters by country
- Abandoned shopping carts (to a separate queue)
Website Analytics Design
flowchart LR classDef blue fill:#66deff,stroke:#000,color:#000 classDef brown fill:#C3C3C3,stroke:#000,color:#000 spark.streaming.website.visits -.- kafka1 Eapp(Ecommerce App):::blue --> kafka1((Kafka)):::brown kafka1 --> job1(Website Analytics Job):::blue spark.streaming.carts.abandoned -.- kafka2 job1 --> kafka2((Kafka)):::brown country-stats -.- redis job1 --> redis[(Redis)]:::blue website_stats.visit_stats -.- db job1 --> db[(MariaDB)]:::blue
Generating a visits data stream
Building a website analytics job
Executing the real-time pipeline
Data Engineering Spark: Best Practices
Batch vs. real-time options
Is Real-time really needed?
- Tendency to build all pipelines in real-time
- It will be super fast
- It’s cool
- Real-time comes with significant design complexities
- Unbounded datasets
- Windowing, watermarks, and missed events
- State management
- Reprocessing
When to choose real-time
- Real-time responses or actions needed (few seconds of latency)
- Input data is a continuous stream of events
- Processing involves minimal aggregation and analytics
- Compute resources are sufficiently available
- Hybrid pipelines are possible
Scaling extraction and loading operations
Scaling Extraction
- Spark support parallel extraction from various data sources
- Analyze available options for the given source
- Choose source technology and build source schema to suite parallel operations
- Push down filtering operations to data sources
- Build defenses against missed records and duplication
Scaling Loading
- Each Spark executor can write independently to a data sink
- Analyze how Spark works with a specific data sink technology (transactions, locking, serialization,…)
- Limit the number of parallel connections to async based on capacity (repartition if needed)
- Build defenses against missed and duplicate updates
Scaling processing operations
- Filter unwanted data early in the pipeline
- Minimize reduce operations and associated shuffling
- Repartition when needed
- Cache results when appropriate
- Avoid actions that send data back to the driver program until the end
Building resiliency
- Spark provides several resiliency capabilities against failures (task, stage, executor)
- Monitor jobs and Spark clusters for failures
- Use checkpoints for streaming jobs with persistent stores
- Build resiliency in associated input stores and output sinks
- Compute infrastructure resiliency is needed too