Werner Daehn
Jan 2, 2021

May I ask you a few questions, just to validate my thoughts?

1. Parquet Partitioning - where multiple Parquet files are written into a directory structure - is that a Parquet or Spark feature? If Parquet, what is the writer class?

2. Parquet is tightly coupled with Hadoop Client, I don't like that. I would like to have a Parquet library with Java File APIs only. The current project to remove the dependencies does not go far enough.

3. What is your thought in regards to indexes, mostly B*Tree indexes for unique fields? I would add them into the page header of Parquet.

Thanks!

Werner Daehn
Werner Daehn

Written by Werner Daehn

Data Integration expert for Big Data and SAP

No responses yet