Apache beam

Apache Beam#

Introduction#

https://beam.apache.org/

What#

https://beam.apache.org/get-started/beam-overview/
an advanced unified programming model
- to define a pipeline
can define both batch and streaming data-parallel processing pipelines
- that pipeline can be run on any execution engine
- engine: Beam's supported distributed processing backends
useful for embarrassingly parallel data processing tasks
- in which the problem can be decomposed into many smaller bundles of data that can be processed independently and in parallel
- can also use Beam for Extract, Transform, and Load (ETL) tasks and pure data integration
  - such tasks are useful for moving data between different storage media and data sources, transforming data into a more desirable format, or loading data onto a new system

Why#

Unified
- Use a single programming model for both batch and streaming use cases
Portable
- Execute pipelines on multiple execution environments
- Language & Engine Agnostic
  - Java, Python, Go, Scala etc.
  - Direct/Local, Apache Flink, Spark, Samza, Nemo, GCP Dataflow, Twister, Hazelcast Jet etc.
Extensible
- Write and share new SDKs, IO connectors, and transformation libraries
Open Source
- Community-based development

How#

Sources
- Beam reads your data from a diverse set of supported sources, no matter if it’s on-prem or in the cloud
Processing
- Beam executes your business logic for Batch and Streaming use cases
Sinks
- Beam writes the results of your data processing logic to the most popular data destinations in the industry

Trivia#

was Google's internal project
previously knows as dataflow

Concept#

Pipeline --> DAG, Topology

Input --> Spout, PCollection[PValue], Source

Processor --> Bolt, PTransform[DoFn], Map, Lambda

Output --> Sink, PCollection[PValue], Target

%% HLD flow-diagram of a typical data pipeline framework
flowchart TB

    %% nodes
    input1("Init")
    input2("Data Extractor/Source/REST API Client")
    processor1["Data Preprocessor"]
    processor2["Data Transformer"]
    processor3["Data Loader"]
    output("Database")

    %% relationship
    input1 --> input2 -- "Data" --> processor1
    processor1 -- "Preprocessed Data" --> processor2
    processor2 -- "Transformed Data" --> processor3
    processor3 --> output

Figure 1. Flow diagram of a typical data processing pipeline/topology

Pipeline#

PTransform#

WebSocket based PTransform#

https://rmannibucau.metawerx.net/apache-beam-websocket-output.html

ParDo#

DoFn#

Setup#

Prerequisites#

Create#

Run#

Run Locally#

Run on Dataflow Runner#

Using service account:

$ python -m exploratory.wordcount \
    --region us-central1 \
    --input gs://dataflow-samples/shakespeare/kinglear.txt \
    --output gs://apache-beam-eg/results/outputs \
    --runner DataflowRunner \
    --project apache-beam-eg \
    --temp_location gs://apache-beam-eg/tmp/ \
    --service_account_email [email protected] \
    --subnetwork regions/us-central1/subnetworks/apache-beam-eg-usc1-temp # mention subnetwork other that `default`

Python SKD#

References:

https://beam.apache.org/get-started/quickstart-py/

Syntax#

Inspired from binary arithmetic operators. Those operators magic methods are overloaded to denote some different pipeline operations. e.g.

|
- is a synonym for apply, which applies a PTransform to a PCollection to produce a new PCollection
- obj.__or__(self, other)
- obj.__ror__(self, other)
>>
- allows you to name a step for easier display in various UIs
- obj.__rrshift__(self, other)
<SomePCollection> |'SomeLabel' >> <SomePTransform>
- SomeLabel here is the string between the | and the >> is only used for these display purposes and identifying that particular application