Parquet Logical Type Support

@balaji.ramaswamy I think I’ve found how to recreate the issue.

I built commit d4c98e76e29d84e13508f6802df4acc5e38b1008 locally and was able to get DEBUG output and found out that the partitioning columns of the file was being reordered when we cleansed the data. Also, since store.parquet.partition_column_limit defaults to 25 the many many partition columns are trimmed and un-luckily, the data I most recently sent you, trimmed out the columns that were the issue.

So, the first thing to do is set store.parquet.partition_column_limit = 300. This should cover the file I’m attaching here that reproduces the error.

Again, the issue comes from the com.dremio.exec.store.dfs.MetadataUtils.toPartitionValue method being unable to process a partition column that is a SMALLINT.

I think this is a bug, but I’ll leave that up to the experts. If it is, I can make the change necessary by adding in the code to the com.dremio.exec.store.dfs.MetadataUtils.toPartitionValue method, but I’m unsure of what that will break. So, some guidance would be appreciated.

problem.snappy.parquet.zip (250.2 KB)

@balaji.ramaswamy I know you’re busy, but any thoughts on this?

Hi @mitchell.davis

Seems like we do not support SMALLINT types. I tested via Hive but Hive internally converts as parquet-tools schema shows the column as INT. I saw that the column in your parquet show as INT16 and that reproduces the error. I assume these are generated by Spark. I will open an internal request on this

Thanks
@balaji.ramaswamy

Thank you @balaji.ramaswamy.

Any update on a timeline for this fix?

Hi @mitchell.davis

We’re assessing the different options on how to handle this particular type. I’ll update you ASAP when we have a firmer timeline for a possible code fix.

Thanks,
@ben

1 Like

Any update on this @ben?