Jan 2, 2021
May I ask you a few questions, just to validate my thoughts?
1. Parquet Partitioning - where multiple Parquet files are written into a directory structure - is that a Parquet or Spark feature? If Parquet, what is the writer class?
2. Parquet is tightly coupled with Hadoop Client, I don't like that. I would like to have a Parquet library with Java File APIs only. The current project to remove the dependencies does not go far enough.
3. What is your thought in regards to indexes, mostly B*Tree indexes for unique fields? I would add them into the page header of Parquet.
Thanks!