Dynamic partition pruning

Does Dremio support dynamic partition pruning on reflections?

I have a virtual dataset called “my_detail_view” which has a raw reflection on it that is partitioned by effective_date.

In the below query the values for effective_date are supplied explicitly and judging by the Input Record counts on the Job overview, dremio loads only the data for the partitions with the effective_date ‘20190118’, ‘20190117’ which is an expected number of records.

	select distinct r.effective_date, r.source
	from "NAS_POC".my_detail_view r
	where r.effective_date in ('20190118', '20190117')

Input Bytes: 573.60 MB
Input Records: 25,934,501

However, in the the below query, the value of the effective_date is retrieved dynamically from a subquery (which returns values ‘20190118’, ‘20190117’). In this case however, dremio loads all the data into memory rather than just for the partitions which utilizes significantly more memory and takes longer to run.

	select distinct r.effective_date, r.source
	from "NAS_POC".my_detail_view r
	where r.effective_date in (select effective_date from NAS_POC."dates") 
Input Bytes: 4.78 GB
Input Records: 221,172,150
1 Like

I’m afraid not at the moment. As partition pruning is done during planning, partition values have to be known before execution.

hi @laurent

Is this a feature that is planned to be implemented in dremio at some point? Is there an alternative way you would suggest?

Our use case is running a report from tableau which needs data from a dataset for a dynamic set of dates based on another query. Even though we created a reflection with a partition on the effective_date, dremio will not take advantage of it in this case and loads all records into memory. The effect is that more memory is needed for the query to run and sometimes hits memory limits we have.

1 Like

It might be implemented at some point, although it’s not on our immediate roadmap at the moment. Unfortunately I don’t have an alternative to offer you.

I have very similar use case, where most of my users are doing time-series analysis. My data is stored in ADLS, partitioned (sub-folders) based on YYYY, MM, DD. The date range for specific analysis is stored in a separate PDS/VDS, which I wanted to use as a sub-query to pick the relevant time-window (hoping the partition pruning will work). When I use literal values, partition pruning works great, but same values through a sub-query goes for a full-scan.

In such scenarios, if I go with simple sub-query or a join, not being able to use partition pruning is part of the approach. But, if I use Common Table Expression (CTE) to determine temporary named result, I was hoping if the query execution plan can be broken into 2 parts; part 1 evaluates every CTE (wherever there are CTEs) and then part 2 receives input from part 1 and plans main SQL.

@sbyte When you use sub query, partition pruning is achieved by using run time filtering. Are you able to send the profile as we can look to see how effective runtime filtering was?