Would it be possible to swap dir0, dir1, etc. for partitioned directory names and also make it work with the query optimizer? Both Spark and PyArrow can write parquet into partitioned directories
Basically instead of writing “where dir0 = ‘year=2007’” can we substitute “where year = ‘2017’” in the example below…
Partitioned Datasets (Multiple Files)
Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways:
A list of Parquet absolute file paths
A directory name containing nested directories defining a partitioned dataset
A dataset partitioned by year and month may look like on disk:
@david.lee do you also include the partition field values in the files?
If this is the case:
When running queries with filters on Parquet based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio will access and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting and caching Parquet file footers and using this information for partition pruning at query time.
Otherwise, we do have some thoughts around including a way for users to define partition data types and for Dremio to automatically detect partition names (given x=y format). We’ll let you know here when we have more formal plans around this.
Any news on this? Pyarrow and Spark work seamlessly with columns as subfolders, and it’s pretty awkward to use things like dir0=‘year=2019’ in Dremio Sql sentences
Also interested in an update on this, currently I am restorting to syntax like
WHERE RIGHT(dir0, 6) > 201834
instead of
WHERE yearweek > 201834
Also a problem that the partition column is excluded from the dataset, this makes connections to other software problematic, when we need filters on this column
Any solution to rename the partition column dir0 with actual partition column? If I substring the column extract the value and crate the virtual column what is the performance impact? Does it apply predicate filter, if I use the virtual column in my where clause?
@rammi16 For file system sources the format would be that, column name will be dir0, dir1 and so on and value would be something like “deptno=20”, but if you map to a Hive external table you can simply use where deptno=20