Dynamic partition pruning

igreg · February 8, 2019, 7:06pm

Does Dremio support dynamic partition pruning on reflections?

I have a virtual dataset called “my_detail_view” which has a raw reflection on it that is partitioned by effective_date.

In the below query the values for effective_date are supplied explicitly and judging by the Input Record counts on the Job overview, dremio loads only the data for the partitions with the effective_date ‘20190118’, ‘20190117’ which is an expected number of records.

	select distinct r.effective_date, r.source
	from "NAS_POC".my_detail_view r
	where r.effective_date in ('20190118', '20190117')

Input Bytes: 573.60 MB
Input Records: 25,934,501

However, in the the below query, the value of the effective_date is retrieved dynamically from a subquery (which returns values ‘20190118’, ‘20190117’). In this case however, dremio loads all the data into memory rather than just for the partitions which utilizes significantly more memory and takes longer to run.

	select distinct r.effective_date, r.source
	from "NAS_POC".my_detail_view r
	where r.effective_date in (select effective_date from NAS_POC."dates")

Input Bytes:	4.78 GB
Input Records:	221,172,150

laurent · February 8, 2019, 7:54pm

I’m afraid not at the moment. As partition pruning is done during planning, partition values have to be known before execution.

igreg · February 13, 2019, 7:20pm

hi @laurent

Is this a feature that is planned to be implemented in dremio at some point? Is there an alternative way you would suggest?

Our use case is running a report from tableau which needs data from a dataset for a dynamic set of dates based on another query. Even though we created a reflection with a partition on the effective_date, dremio will not take advantage of it in this case and loads all records into memory. The effect is that more memory is needed for the query to run and sometimes hits memory limits we have.

laurent · February 21, 2019, 7:23pm

It might be implemented at some point, although it’s not on our immediate roadmap at the moment. Unfortunately I don’t have an alternative to offer you.

sbyte · May 6, 2024, 2:55pm

I have very similar use case, where most of my users are doing time-series analysis. My data is stored in ADLS, partitioned (sub-folders) based on YYYY, MM, DD. The date range for specific analysis is stored in a separate PDS/VDS, which I wanted to use as a sub-query to pick the relevant time-window (hoping the partition pruning will work). When I use literal values, partition pruning works great, but same values through a sub-query goes for a full-scan.

In such scenarios, if I go with simple sub-query or a join, not being able to use partition pruning is part of the approach. But, if I use Common Table Expression (CTE) to determine temporary named result, I was hoping if the query execution plan can be broken into 2 parts; part 1 evaluates every CTE (wherever there are CTEs) and then part 2 receives input from part 1 and plans main SQL.

balaji.ramaswamy · May 13, 2024, 6:52am

@sbyte When you use sub query, partition pruning is achieved by using run time filtering. Are you able to send the profile as we can look to see how effective runtime filtering was?

Topic		Replies	Views
Dynamic Partitioning and Parameterized Queries in Dremio: Capabilities and Functionality	3	117	September 25, 2024
Dynamic Partition Handling for Event-Date Queries in Dremio	6	67	October 2, 2024
Advice on downstream reflections from INCREMENTAL physical	0	956	March 11, 2021
Can't setup a raw reflection on a virtual dataset out of memory	1	1149	March 25, 2019
Puzzling behavior with RAW Reflections and VDS	6	1289	August 31, 2021

Dynamic partition pruning

Related topics