Who needs stream processing?
If you need to answer to events as quickly as possible, stream processing fits perfectly in your world.
To demonstrate how you can use stream processing to consume data as it arrives I will build a concrete example using the client library Kafka Streams and some of the best open-source frameworks in the market Apache Storm and Apache Flink. Fortunately, I had the luck to work with Kafka Streams, Apache Storm and Apache Flink in production. The last one, Apache Flink is my recent production-ready product with exactly-once delivery semantic.
Let’s start our example …
John started working as a corrector in a well-established investment bank and to simplify their job, John had the idea to hire a Software Engineer to build an application capable to consume price updates on stocks and generate alerts when a stock decreases or increases the price in a short period of time.
The Software Engineer hired knows that needs to build a system capable to ingest a tremendous amount of price updates, so wherever the technology adopted needs to be fast and reliable to process all of the data.
After some analyses, our developer decides to use stream processing.
Apache Flink
Before we start coding is important to understand what is Apache Flink and understand how they are linked to set up an application.
Our agenda for today will be:
1. Components of Apache Flink
2. Deployment Modes
The next session will be explained how to configure and start Flink in Kubernetes to be easy to execute the application that we will build.
1. Components of Apache Flink
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
description of what is Apache Flink from the documentation.
Flink as you can imagine consists of multiple components running across multiple machines in a distributed environment with a lot of challenges inherited by all distributed systems. Flink does not reinvent the wheel and not implements all of the functionalities by itself instead of using the highest industry standards. For instance, to high availability Apache Flink depends on Apache Zookeeper and for reliable distributed storage it can use HDFS, S3, etc. This was a solution adopted by the community to focus on its core distributed processing engine for stateful computations over unbounded and bounded data streams.
Flink is composed of four components: Dispatcher, JobManager, ResourceManager, and TaskManager.
- Dispatcher
Exposes a REST API for job submissions and lives across jobs.
Besides the REST API, the dispatcher provides a lot of information about the jobs submitted/running on a web dashboard.
- JobManager
It is a master process responsible for a single application and owes to him the responsibility to request resources to ResourceManager to have enough TaskManager slots to run the application.
- ResourceManager
Manage available TaskManagers and keeps an available pool of TaskManager slots which could be assigned to JobManager when requested to fulfill the request, in case the request couldn’t be satisfied ResourceManager, depending the deployment mode, can talk with the resource provider (YARN, Kubernetes …) to initiate more TaskManagers which will be assigned to the JobManager.
- TaskManager
TaskManagers are responsible to run the application sent by the JobManager. Each TaskManager has a finite number of slots which is actually where the job is executed. When JobManager requests resources to process the job, ResourceManager assigns TaskManager slots to execute the job.
2. Deployment Modes
Flink can be orchestrated in different ways, such as a standalone cluster, a YARN cluster, Mesos cluster, or a Kubernetes cluster. We will focus on the Kubernetes cluster but if you are interested to know the other modes you can read more in the documentation.
Now, it is supposed you have an idea about the components of Flink and how they interact with each other this will be important concepts to put your application running in Flink cluster, in the next session, we will orchestrate a Kubernetes cluster to run Flink jobs.