My issues with Kafka Connect
The list of problems I found are in two areas. One is the Kafka Connect framework itself, the other the implementations of most Connectors.
Kafka Connect framework
- Source record is converted into a Kafka Connect record which is converted into an Avro record. That is unnecessary overhead and not utilizing the Avro functionality to its full extend.
- It should not be a cluster on its own, it should be multiple instances of containers. This way dynamic load balancing and automatic adjustments to the number of nodes would be possible. For consumers this is no problem. When Kubernetes decides to add/remove an instance due to the current load, a rebalance is triggered in Kafka automatically distributing the work evenly. For producer there is no similar feature available. Requested in 2019, though.
- Kafka Connect cannot consume from topics with multiple schemas, however in Kafka guaranteed order can only be achieved within a single topic partition.
- Producer partitioning logic is based on Kafka partitions, must be source partitions. Imagine you read from a source table that consists of 100 database partitions, one per sales region. How can producer instance is #7 know which database partitions it should read?
- Kafka Connect is configured via properties files, curl calls and the such. Not acceptable to be competitive, it requires a WebUI where the settings themselves are configured but also configurations. Take for example a CSV producer, where you preview a file, define the code page, column and row separator, escape char, quote char, headings, which column is a datetime, …. and the impact of these settings can be previewed online before starting to produce the data. Something like this.
- Producers do not provide metadata about the source, e.g. are more precise data type like NVARCHAR(10). The only information in the schema is that this is a String data type and a consumer has no choice other than storing that in an NCLOB — a database performance nightmare. I provide helper methods for Avro via github here.
Shortcomings in certified Kafka Connectors
- Many database producers do not support efficient CDC mechanisms to find out which database records have been changed. JDBC Producer for example requires a timestamp column that is guaranteed to be set at any DML operation and no deletes ever happen. That applies to probably zero tables in a normal source database.
- Most database Consumers cannot deal with nested data in Kafka. They expect a relational record. But a nested data model can be converted into a relational model by using multiple tables and foreign key relationships.
- Most database Consumers do not support schema evolution. If a column got added to the Kafka schema, add a column to the target table.
In short, Kafka Connect works well in stable, technology driven teams, for scenarios where e.g. data from MQTT is written into a Data Lake.
But it is not well suited for Data Integration scenarios like database to database.
And now?
Assuming above said is true, what are the consequences, what is the solution? For me personally it was to build another framework more targeted towards the Data Integration use case. Would have loved to use Kafka Connect but couldn’t. This framework is available as open source here.
Having said that, I really hope Kafka Connect will incorporate some of my ideas.