- Falk Borgmann

Event based data processing and integration – Part 1: The messaging concept of Apache Kafka

In this Kafka series, we look at the concepts and best practices around Apache Kafka. The first part is about the basic concept behind the technology or how it is differentiated and fits into the world of data streams and messaging solutions. But we also describe the gaps that arise after the implementation of the technology and that should be considered in the context of projects.
Concepts for moving data and information between computers or providing it to a person or service have undergone constant change. To date, there are only two conceptual methods for transferring messages from A to B. In the technical realization, the solutions of the manufacturers differ significantly. Each technical integration system or software solution has both strengths and weaknesses. It is important to be aware of this. The gaps left behind must be known and, if necessary, they must be closed, depending on the use case - for example, through additional software or organizational procedures

© Deepshore GmbH/sense:ability communications GmbH – Andreas Otto

Point-to-Point Queuing
On the one hand, we speak of point-to-point communication: that is, a message can be imagined as a piece of paper or a folder on a stack of paper applications in a German public authority. These slips of paper are then processed one after the other by a clerk. This stack of papers is similar to a technical message queue, in which data or information can be processed only sequentially, one after another: the same message can never be processed by two people or data recipients at the same time, since it physically exists only once in the queue and disappears from there as soon as it has been consumed.
Technically, we encounter this procedure in the classic JMS (Java Message Service) queues of various products and manufacturers. The advantage of the procedure is that when data is transmitted one-to-one, it is easier to ensure complete transmission (transactionality), because defined quality and security mechanisms can also be implemented in a dedicated manner. In addition, data is processed asynchronously. This means that the consumer does not have to process the data at the time of provision. One disadvantage is certainly that in the end it is always a proprietary interface between the sender and a recipient. If a message is intended for multiple recipients at the same time, it must somehow be copied and delivered accordingly. In reality, so-called EAI or ESB systems (Enterprise Application Integration; Enterprise Service Bus) often take over this integrative task in order to close technical gaps in the process.

The counter design to point-to-point queuing is called publish/subscribe (also pub/sub). Imagine a group of people sitting in a circle and one person shares their message loudly and audibly for everyone. A message can thus be received and processed by several recipients at the same time. We all know the procedure from a radio program or a notice on a bulletin board. The difference compared to technical JMS queuing is obvious. A message can be received by many recipients at the same time, but the sender does not know exactly if and who really received the message, because there is no channel for feedback from the recipient. The lack of feedback complicates the possibility of simple transaction monitoring, since it is never clear who received which message. The definition of asynchronous, i.e. delayed, processing of data is also not present in pure pub/sub, since all consumers are "informed" immediately and simultaneously.

© Deepshore GmbH/sense:ability communications GmbH – Andreas Otto

Messaging with Kafka
Kafka is basically a technical mixture of these two methods outlined above. Data is written serialized into a so-called topic. A whole group of data consumers can in turn make use of a topic without competing with each other. Raw data is simply read from a log, so that this process can be repeated as often as desired. Kafka writes new messages to the existing log as soon as they arrive in the system. Because it is possible to divide the data into different topics, differentiated processing is also possible. This means that closed consumer groups can be created for different types of data, which only ever have access to certain messages. It is precisely this concept of topics with their consumer groups that creates a hybrid of dedicated queuing and publish and subscribe.

© Deepshore GmbH/sense:ability communications GmbH – Andreas Otto

Whether Kafka is suitable for a specific use case or rather less depends on what exactly one wants to achieve and how the respective requirements are differentiated. Kafka delivers the full potential of a distributed system, such as the possibilities of horizontal scaling. Its performance is demonstrated in practice by examples such as LinkedIn, Twitter or PayPal. However, it is important to also be aware of the conceptual weaknesses or better the existing gaps. Thus, it is advisable to at least have understood the CAP theorem in connection with an ACID transaction in order not to build gap-filled and deficient IT solutions with a promising technology.

Preview of Part 2
In the next part of the series, we will therefore look at typical application scenarios and work out Kafka's strengths. Where there is light, there is also shadow. Hence, we will also look at what this system is not suitable for in the standard and what gaps it leaves behind that need to be filled elsewhere. As is so often the case, implementations and projects fail because of a lack of understanding of the limits and weaknesses of a technology. Therefore, it is worthwhile to broaden the perspective from a technical as well as a business and regulatory point of view.