Dir0, dir1 and partitioned datasets, etc

Would it be possible to swap dir0, dir1, etc. for partitioned directory names and also make it work with the query optimizer? Both Spark and PyArrow can write parquet into partitioned directories

Basically instead of writing “where dir0 = ‘year=2007’” can we substitute “where year = ‘2017’” in the example below…

https://arrow.apache.org/docs/python/parquet.html

Partitioned Datasets (Multiple Files)
Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways:

A list of Parquet absolute file paths
A directory name containing nested directories defining a partitioned dataset
A dataset partitioned by year and month may look like on disk:

dataset_name/
year=2007/
month=01/
0.parq
1.parq

month=02/
0.parq
1.parq

month=03/

year=2008/
month=01/

1 Like

@david.lee do you also include the partition field values in the files?
If this is the case:

When running queries with filters on Parquet based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio will access and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting and caching Parquet file footers and using this information for partition pruning at query time.

Otherwise, we do have some thoughts around including a way for users to define partition data types and for Dremio to automatically detect partition names (given x=y format). We’ll let you know here when we have more formal plans around this.

Both. Pyarrow generated datasets include columns as subdirectories. Spark generated parquet files adds subdirectory names to the schema.

https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

Any news on this? Pyarrow and Spark work seamlessly with columns as subfolders, and it’s pretty awkward to use things like dir0=‘year=2019’ in Dremio Sql sentences

:+1: - very interested in improvement in this area as well

Is there any update on this? I’m very interested with an improvement on the query syntax.

@mg66

Currently the partition behavior has not changed

Thanks
@balaji.ramaswamy

Also interested in an update on this, currently I am restorting to syntax like

WHERE RIGHT(dir0, 6) > 201834 :sweat_smile:

instead of

WHERE yearweek > 201834 :smiling_face_with_three_hearts:

Also a problem that the partition column is excluded from the dataset, this makes connections to other software problematic, when we need filters on this column

Any solution to rename the partition column dir0 with actual partition column? If I substring the column extract the value and crate the virtual column what is the performance impact? Does it apply predicate filter, if I use the virtual column in my where clause?

@rammi16 Would you be able to map an external Hive table or attach to a Glue catalog? This way you can directly use the partition column

Sorry for the late reply, I did not try. we were trying to create the view on top of the actual table and mapped it accordingly.

@rammi16 For file system sources the format would be that, column name will be dir0, dir1 and so on and value would be something like “deptno=20”, but if you map to a Hive external table you can simply use where deptno=20

1 Like

@balaji.ramaswamy Thanks for the info. Are there any plans to align with Spark/Hive/Pandas/etc. ? What about enterprise edition?

@shay , For file system sources, there are currently no plans