Kafka as Data Integration solution — Part 1 — Transaction Order

Werner Daehn
3 min readMar 25, 2021

When I was designing Data Integration products, the options were always clear: Either you go for high performance, meaning multiple parallel streams of disconnected sessions (ETL) or the aim is honoring the transaction order, hence sequential processing is a must (Realtime data integration). It’s logical.

http://alphastockimages.com/

Kafka impact

Then came Kafka and said: “You are wrong”. In the Kafka world these two cases are only edge cases with a whole lot options in between.

Example 1: In your database two Independent applications are running, a finance and an HR application. There is not a single transaction spanning both applications. So in that case, the data can be read in parallel, one stream per application.

In other words, the transactional guarantees depend on the actual transactions being executed and the queries on the created data. If the shipping backlog is an important KPI, the order entry and the shipments better be in the correct order. But the billing documents can be seen separately.

Example 2: What happens when two customers create orders but the data is loaded in the wrong order? Does that really matter? As long as the orders of a customer are processed in the correct order — think about order and reverse order — the answer is likely no one will ever notice in his queries.

Transactional order in Kafka

The way Kafka is built, an ordering guarantee is provided within a single topic partition. The logical consequence is, that all where a transactional order must be maintained must also go through a single topic partition.

If we use SAP ERP as example, there are the tables KNA, VBPA, VBAK and VBAP for Customer, Customer-Role, Order Header, Order Line item. Requirement #1 is that all data of any of these tables goes into a single topic, let’s call it SalesTopic. And it has a partitioning function based on the customer number, thus all changes of a specific customer end up in the same partition.

Is this a problem for Kafka? No. Partition functions are a common setting and one topic can handle different data easily. The schema information is not tied to the topic but part of the payload. In the example of Avro messages, the payload consists of a magic byte, the schema id and the Avro encoded message using the schema. The mapping between schema id and schema definition is stored in the schema registry service.

Nothing must be changed in Kafka to support that, it works since day one. Confluent started to talk about that option recently. The reason we are so used to the rule of one topic = one schema is because of the limitations Kafka Connect imposes. But even here, producers support that. Which does not help much when Kafka Connect Sinks cannot. :sight:

So in that sense, the ETL approach is one topic for each source table and a round robin partition function. The Realtime replication scenario would be one topic with a single partition. Both are valid cases but only there are a whole lot of options in between.

Hence one word of advise: Whatever solution you pick in the context of Kafka to produce or consume data, make sure it either supports multiple event types in a single topic or there are plans to do so like in the example of Kafka Connect. It is easy to implement for the vendor and without that, data will never be consistent for the consumers and in the target systems.

--

--