Streaming CSV file to Kafka — The easy way

Werner Daehn
3 min readNov 18, 2020

For loading CSV files into Kafka, myriads of different options exist. Most are very technical or make assumptions about the file. Here I would like to show another option which provides a graphical user interface and is built by a long time data integration expert.

File configuration, first tab

When loading CSV files into Kafka, a couple of decisions have to be made:

  • Who creates the Avro schema? Does it exist already or should the schema match the CSV file definition?
  • Loading a file once or loading files the instant they appear? How to mark that a file has been loaded already and should not be loaded again?
  • Files have settings on global level like which character set encoding is used? ASCII, UTF-8, UTF-16, a 8bit codepage of country xyz?
  • What formatting is used like column separator, escape characters, is text quoted,…?
  • And of course the per-column settings with column names, data types, conversion from a string into a timestamp-millis Logical Avro type and the such.

The assumption of this method is that files of the same format follow a file name pattern, e.g. load all plan_*.txt files and the instant a new file appears, it is loaded and then renamed by appending the name with “.processed”. If tomorrow the new “plan2020.txt” is uploaded, it will end up in Kafka seconds later. To be more precise, each schema supports a filename regexp pattern. In above example the pattern is “address.*” hence this schema parses all files with address data in it.

Another set of files with the actual data, using the “actual_.*\.txt” has a different structure and hence another schema definition.

Due to the number of settings for CSV files and their dependencies, coming up with the correct flags can be cumbersome. For simple demo cases, no problem. But specifying a column separator as “,” and then some data is wrong because the NAME field can have a comma as well as part of the payload, requires more feedback. Therefore the UI exists and provides immediate visual information about the parsing process. If the file was written on an old system with an uncommon codepage, the user can see the result right away. Either the text makes sense or it does not.

As said, for modern systems no problem, they all use UTF-8 or encode the Unicode encoding version into the first few bytes of the file. Modern systems rarely resort to CSV file though, it is the legacy systems which do!

Similar UI feedback for all other config steps.

Note: At the end, all UI decisions end up in json config files and Avro avsc schema files. This simplifies handling for the IT department.

Once the file schema is defined, either the schema is created in the Kafka Schema Registry 1:1 or a mapping from the file schema into the target schema is defined in addition.

Installation

As Kafka Connect has multiple intrinsic limitations, one of which being the lack of connector specific UIs, this solution is provided as docker image.

For a first quick demo the commands

docker pull rtdi/fileconnectordocker run -d -p 80:8080 — rm — name fileconnector \\
rtdi/fileconnector

are sufficient. With the docker pull it is made sure the latest version is used and the docker run line starts the container and removes all after a docker stop command.

As the docker run line routes the application to the port 80, opening this page via the browser connects to the UI:

http://hostname:80/

The next step is to login (default: rtdi / rtdi!io), configure the connectivity to Kafka and schema registry, configure the connectivity to the file system and then define the schemas and producer instances.

The details of these steps can be found in the github repository: https://github.com/rtdi/FileConnector

As with any Open Source project, add Issues, suggestions, code changes in the github repo.

--

--