Kafka as Data Integration solution — Part 3 — Gift horse
When reading data from a source system, would you throw away easily available information? Of course not, but why is it done then?
Again, comparing with the gold standard of Kafka Connect, there is a lot of information not utilized, mostly around metadata.
First of all, in a larger Data Integration scenario there will be lots of sources, targets and services. Easy to get lost. It makes sense if all publish the information where they get the data, the targets and the schemas used.
In this example the RuleService1, a Kafka KStream service tells that it connects the SALES topic to the SALES_CLEANSED topic. This information is stored in an additional Kafka metadata topic and can be read to visualize how all is connected with each other.
Similar the landscape, where is each service hosted? My Kafka server is hosted at confluent cloud, the ERP Producer. The ERP Producer reads from a source database called Connection1. This is all trivial information but in a real life project extremely valuable.
None of this information is provided by Kafka Connect but should.
Database →Kafka →Database
I have the same problem with the schema information. To visualize the problem, let’s use Kafka Connect to read one database table into Kafka and another to write the data 1:1 into another database. The source connector reads the table structure and if configured correctly will use a schema with INT, STRING, DECIMAL etc datatypes. And here is the problem, what has been a VARCHAR(1) in the database is a STRING in the schema and what data type will the writer use? All it sees is a STRING of any length and the only correct target datatype is a NCLOB — to store a single character!
What I did is creating Avro logical datatypes for all the common database data types. Consuming apps still see an Avro String but the schema definition contains the extra information that this string is actually a VARCHAR(1) and use the correct data type in the target. If the data type changes, e.g. turned into a NVARCHAR(1) this is a normal schema evolution process and the loader can alter the target table automatically.
I have proposed these logical data types to the Avro team as well and while it spiked interest, I am not sure my pull request will get accepted anytime soon.
But that is no problem, logical types can be implement easily. It just would make sense to standardize on naming and semantic.
Record routing and transformations
The last bit of information is metadata at record level. If a record is read, it has meta information as well.
- What was the source system?
- What was the source transaction that triggered that change?
- When was the record read by the Connector?
- What is the rowid of that record in the source?
- …
All information which is appreciated for auditing, increase trust into the processes and for documentation purposes. In the past this would have been too expensive for a database. But Kafka? A Data Lake storage? No problem.
Hence all my schemas these extra fields to store that information. Might be a good idea for your projects as well!