Dir0, dir1 and partitioned datasets, etc

david.lee · June 19, 2018, 3:17pm

Would it be possible to swap dir0, dir1, etc. for partitioned directory names and also make it work with the query optimizer? Both Spark and PyArrow can write parquet into partitioned directories

Basically instead of writing “where dir0 = ‘year=2007’” can we substitute “where year = ‘2017’” in the example below…

https://arrow.apache.org/docs/python/parquet.html

Partitioned Datasets (Multiple Files)
Multiple Parquet files constitute a Parquet dataset. These may present in a number of ways:

A list of Parquet absolute file paths
A directory name containing nested directories defining a partitioned dataset
A dataset partitioned by year and month may look like on disk:

dataset_name/
year=2007/
month=01/
0.parq
1.parq
…
month=02/
0.parq
1.parq
…
month=03/
…
year=2008/
month=01/

can · June 19, 2018, 11:02pm

@david.lee do you also include the partition field values in the files?
If this is the case:

When running queries with filters on Parquet based datasets, if there are files that only include a single value for a field included in the filter condition, Dremio will access and scan only relevant files – even if there isn’t any explicit directory structure for partitioning. This is achieved by inspecting and caching Parquet file footers and using this information for partition pruning at query time.

Otherwise, we do have some thoughts around including a way for users to define partition data types and for Dremio to automatically detect partition names (given x=y format). We’ll let you know here when we have more formal plans around this.

david.lee · June 21, 2018, 10:37pm

Both. Pyarrow generated datasets include columns as subdirectories. Spark generated parquet files adds subdirectory names to the schema.

https://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery

forwardkeys_it · October 8, 2019, 11:38pm

Any news on this? Pyarrow and Spark work seamlessly with columns as subfolders, and it’s pretty awkward to use things like dir0=‘year=2019’ in Dremio Sql sentences

ddd · December 5, 2019, 7:36am

- very interested in improvement in this area as well

mg66 · July 15, 2020, 6:35pm

Is there any update on this? I’m very interested with an improvement on the query syntax.

balaji.ramaswamy · July 25, 2020, 9:22pm

@mg66

Currently the partition behavior has not changed

Thanks
@balaji.ramaswamy

nrlund · August 27, 2020, 1:44pm

Also interested in an update on this, currently I am restorting to syntax like

WHERE RIGHT(dir0, 6) > 201834

instead of

WHERE yearweek > 201834

Also a problem that the partition column is excluded from the dataset, this makes connections to other software problematic, when we need filters on this column

rammi16 · August 23, 2021, 10:49am

Any solution to rename the partition column dir0 with actual partition column? If I substring the column extract the value and crate the virtual column what is the performance impact? Does it apply predicate filter, if I use the virtual column in my where clause?

balaji.ramaswamy · August 24, 2021, 8:00am

@rammi16 Would you be able to map an external Hive table or attach to a Glue catalog? This way you can directly use the partition column

rammi16 · September 23, 2021, 7:59am

Sorry for the late reply, I did not try. we were trying to create the view on top of the actual table and mapped it accordingly.

balaji.ramaswamy · September 25, 2021, 10:50pm

@rammi16 For file system sources the format would be that, column name will be dir0, dir1 and so on and value would be something like “deptno=20”, but if you map to a Hive external table you can simply use where deptno=20

shay · March 14, 2022, 12:04pm

@balaji.ramaswamy Thanks for the info. Are there any plans to align with Spark/Hive/Pandas/etc. ? What about enterprise edition?

balaji.ramaswamy · March 25, 2022, 3:58am

@shay , For file system sources, there are currently no plans

Topic		Replies	Views
Dremio not using partition filter	3	164	May 19, 2024
Strange planner behavior (and performance consequences)	3	1352	September 3, 2019
Is it possible to access filename from query?	6	2585	December 18, 2023
Create table from custom partitions Dremio University	5	1900	October 2, 2020
Dynamic Partitioning and Parameterized Queries in Dremio: Capabilities and Functionality	3	140	September 25, 2024

Dir0, dir1 and partitioned datasets, etc

Related topics