Dir0, dir1 and partitioned datasets, etc

Would it be possible to swap dir0, dir1, etc. for partitioned directory names and also make it work with the query optimizer? Both Spark and PyArrow can write parquet into partitioned directories

Basically instead of writing “where dir0 = ‘year=2007’” can we substitute “where year = ‘2017’” in the example below…

https://arrow.apache.org/docs/python/parquet.html

Partitioned Datasets (Multiple Files)
Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways:

A list of Parquet absolute file paths
A directory name containing nested directories defining a partitioned dataset
A dataset partitioned by year and month may look like on disk:

dataset_name/
year=2007/
month=01/
0.parq
1.parq

month=02/
0.parq
1.parq

month=03/

year=2008/
month=01/

@david.lee do you also include the partition field values in the files?
If this is the case:

When running queries with filters on Parquet based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio will access and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting and caching Parquet file footers and using this information for partition pruning at query time.

Otherwise, we do have some thoughts around including a way for users to define partition data types and for Dremio to automatically detect partition names (given x=y format). We’ll let you know here when we have more formal plans around this.

Both. Pyarrow generated datasets include columns as subdirectories. Spark generated parquet files adds subdirectory names to the schema.

https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Any news on this? Pyarrow and Spark work seamlessly with columns as subfolders, and it’s pretty awkward to use things like dir0=‘year=2019’ in Dremio Sql sentences