Don’t leave Apache Flink and Schema Registry alone

Thinking earlier to evolve our architecture could be tough but won’t be more harder and complex if you decide to tackle this challenge at a later stage.

Perhaps at first sight re-design the architecture to have a Schema Registry could take from you some time that can be invested in developing another feature, but believe me, as the system is growing and growing is necessary to have a catalog of the events, as well a well-established definition of each one.

Schema Registry?

Schema Registry is a solution to solve the challenge of sharing schemas between the applications allowing:

  • Evolution of data in a simple way, where the consumers and producers can be using different schemas version and even doing that can be compatible, depending on the compatibility strategy used backward or forward compatibility.
  • Centralized repository/catalog, having a centralized piece that manages the data models, allows your architecture to have a data catalog that knows all the data types in the system.
  • Schemas enforcement ensures the data published, matches the schema defined and the constraints are respected.

This will facilitate the implementation of other features like auditing, GDPR and is a natural step to have quality in the data of our system.

Apache Flink and Schema Registry

For this integration is used Apache Kafka, Schema Registry Confluent, and Apache Flink.

Using Avro to define our model

Record definition in Avro

In the above schema, it was defined the event Record composed by the fields name, age, date, and city. The schema is already defined and now is time to start the schema registry, to do that will be used the images defined in the confluent repo.

docker-compose.yml

If you want to understand better want is going on with your schema-registry you can use the visual tool schema-registry-ui to see/edit/evolve the schemas.

Other tools:

  • Kafka Tool: GUI application for managing and using Apache Kafka clusters
  • Kafkacat: Generic command line non-JVM Apache Kafka producer and consumer

I will advise you to have at least Kafka Tool or kafkacat to fully understand the example and to test your wonderful Schema Registry.

Register the new schema

Using the Avro format firstly is necessary to register the schema before starting to use it, to do that you can register it manually using the REST API or use the maven plugin. The maven plugin can simplify the process and reduce the complexity.

Schema registry plugin

To have a quick understanding of the basics concepts of the schema registry you can read the documentation on confluent. Assuming you already have the basic knowledge of what is maven and how plugins work, maybe what you don’t understand is the concept of subjects that is defined above.

After the registration of the schema, you can use your favorite tool to see what was created in the registry. The schemas are registered by default in the topic _schemas.

Topic _shemas used to store the schemas created

Inserting messages

To send events, the utility kafka-avro-console-producer will be used to insert valid and invalid events, the schema of each event sent is validated using the Schema Registry that is up-and-running in the endpoint specified in the command arguments.

Above you can see the argument value.schema.id=1 which specifies the schema and must match the id of the schema you created.

Inserting a valid message:

Inserting an invalid message:

As the field date is required, the insertion will fail.

Build Flink Job

Apache Kafka Flink provides a built-in connector to read and write messages to Apache Kafka, in the documentation you easily find how to integrate the connector in the project.

Besides the Kafka connector which enables the requirement to consume/produce messages into the message log, is necessary to use a deserializer that reads data serialized Avro format. Exits at least two AvroDeserializationSchema and ConfluentRegistryAvroDeserializationSchema, in our example, will be used the second one because reads the schema registered in the Schema Registry and deserialize the message using the schema retrieved.

Note: deserializing records that won’t match the schema retrieved will throw an Exception

The consumer is simple to create if is inserting only one event type by topic:

Kafka consumer definition

Similar to the consumer, the producer is simple as:

Kafka producer definition

Is possible to define multiple event types in the same topic?

Exists two blog posts 1) and 2) very interesting where are explained the purpose to have a topic containing multiple event types instead of having a topic with only one event type.

If you have a very rich system and perhaps you are using Kafka as an event sourcing, the need to have more than one type in a topic will arise.

Bonus: Flink Job reading multiple event types in a single topic

Slight changing the previous approach and to cope with the requirement:

the interface KafkaDeserializationSchema to deserialize the messages from Kafka will be extended in order to be possible to deserialize multiple event types using the same topic.

Multiple event types deserializer

Allowing multiple event types raises some changes on the consumer, instead of being used the deserialize ConfluentRegistryAvroDeserializationSchema will be used KafkaAvroDeserializer.

Flink consumer for multiple event types

After consuming and properly deserializing a valid event, the event types typeA and typeB will be transformed in our internal domain:

Model record

After the conversion to our internal domain, depending on the requirements of the Job can be applied transformations like map, filter, reduce, and aggregations in order to satisfy our needs.

This is an example to show you the capabilities to take advantage of a well-defined system avoiding producers/consumers from generating bad events and breaking your architecture. Having a catalog of events isn’t easy but you need to do small steps to avoid future headaches.

Kafka producer

To test the Flink job you can produce messages using the same tooling used previously, kafka-avro-console-producer, but now you need to identify the type of event you are inserting otherwise the message will fail to be produced.

Summary

The benefits of having a schema registry in our architecture are huge which allows event types to evolve and to be sure exactly what’s going on in our system.

If you are interested to know more about the JSON schema and the options available to combine multiple event types, I advise you to take a look at the documentation to dissipate doubts you can have, this is normal if it is the first time you are using this definition.

All the source code can be found here

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store