I do see ManifestList Filter
and ManifestFile Filter
in the execution profile, when a query filters on a field used for partitioning - as long as the transform function is the identity function. Specifically, given the table (created using Spark):
CREATE TABLE iceberg_identity_date (timestamp timestamp, partition_day date) USING iceberg partitioned by (days(timestamp), partition_day)
-and data (inserted using Spark):
INSERT INTO iceberg_identity_date (timestamp, partition_day) VALUES (timestamp ‘now’, current_date)
INSERT INTO iceberg_identity_date (timestamp, partition_day) VALUES (timestamp ‘yesterday’, date_add(current_date, -1))
The following query in Dremio query filters using the manifest list and manifest files, because the partition transformation on partition_day
is the identity function:
SELECT * FROM iceberg_identity_date
where partition_day = current_date
-as seen in the final transformation in the execution profile:
00-07 TableFunction(columns=[splitsIdentity
, splits
, colIds
], ManifestFile Filter AnyColExpression=[(ref(name=“partition_day”) >= 19040 and ref(name=“partition_day”) <= 19040)]) : rowType = …
00-08 IcebergManifestList(table=[minio.“steen-test”.“default”.iceberg_identity_date], columns=[splitsIdentity
, splits
, colIds
], splits=[1], ManifestList Filter Expression =[(ref(name=“partition_day”) >= 19040 and ref(name=“partition_day”) <= 19040)]) : rowType = …
However, if we query by timestamp, filtering using manifest list and files doesn’t happen, as the partition transformation is days(timestamp)
:
SELECT * FROM iceberg_identity_date
where “timestamp” = current_timestamp
-as seen in the execution profile:
00-09 TableFunction(columns=[splitsIdentity
, splits
, colIds
]) : rowType = …
00-10 IcebergManifestList(table=[minio.“steen-test”.“default”.iceberg_identity_date], columns=[splitsIdentity
, splits
, colIds
], splits=[1]) : rowType =…
As simpler way is to look at the input records reported by the query, where the former has none and the latter has 2, but both return 0 records.
Looking in the Dremio source code, com.dremio.exec.store.iceberg.IcebergUtils:getInvalidColumnsForPruning(...)
does the filtering of partitioning columns used for pruning, where non-identity transformations are one of the filter criteria:
…
if (entry.getKey() != 0 || !partitionField.transform().isIdentity()) {
…
-which is used when generating splits:
…
/**
* we can not send partition column value for 1. nonIdentity columns or 2. columns which was not partition but later added as partition columns.
* because in case 1. information will be partial and scan will get incorrect values
* in case2. initially when column was not partition we don’t have value.
*/
if(invalidColumnsForPruning != null && invalidColumnsForPruning.contains(fileSchema.findField(field.sourceId()).name())) {
continue;
}
…
The code here documents that non-identity columns won’t be used for pruning. I am sure pruning is working as intended, for the moment, because supporting pruning on non-identity transforms seems to require some more work than on simple identity ones.
That said, coming back to my original question - would non-identity transform support “just” need some more work and come sometime in a future release, or is something actually hindering this (either technically or simply due to priority)?
I appreciate if you aren’t able to comment on roadmap or make any promises - it would just be great input when deciding how to optimally design our data lake, while still being able to use Dremio properly on top.