I am running Dremio cloud (Not software) - and using S3 as a source.
I formatted a folder with a parquet that looks like this.
└── year=2023
├── month=08
│ ├── day=19
│ │ └── userdata.parquet
│ ├── day=20
│ │ └── userdata.parquet
│ └── day=21
│ └── userdata.parquet
└── month=09 (Added later after first ingestion - I am trying to refresh metadata for this)
└── day=15
└── userdata.parquet
Then queried to see all my data correctly.
Then added new data to month=09/day=15.
Running this gives me the errors in the screenshot.
ALTER TABLE "xyz"."standalone-test_1" REFRESH METADATA FOR PARTITIONS (
dir0='year=2023',
dir1 = 'month=09',
dir2 = 'day=15',
"year"='2023',
"month"='09',
"day"='15'
);
“Input error. Expected partition dir0”
Also, another screenshot (Cannot attach it due to new signup restrictions - which I got by accidentally misnaming one of the partitions) shows me the list of partitions - and I can confirm, that I have included all partitions too, but I still am unable to refresh the metadata for the partition.
Although REFRESH METADATA without “FOR PARTITIONS” works as intended.
What am I doing wrong? Please help.
I am attaching the folder for reference.
test_parquet.zip (154.9 KB)
@rohitshetty, Welcome to Dremio Community!
Try the following command. Works for me on your sample dataset:
ALTER TABLE path.to.dataset REFRESH METADATA FOR PARTITIONS (
dir0 = 'year=2023',
dir1 = 'month=08',
dir2 = 'day=21'
);
Hi @lenoyjacob,
One thing I think @rohitshetty forgot to mention (we work together) is that the source requires “Enable partition column inference” to be set, which creates partitions for the inferred columns too. If we do not include those partitions in the REFRESH command we get the following error:
When we do include them we get the original error “Input error. Expected partition dir0”.
Do you think you can run your test again with the partition inference enabled?
All the help is very much appreciated.
Cheers,
Jonathan
Yup, looks like a bug. I’ve raised a ticket internally. As a workaround disable partition inference and use the metadata refresh query I posted above.
For regular queries, you can create a View to abstract away the prefixed “dir0=”, “dir1=” and “dir2=” using something like split_part().
Thank you @lenoyjacob , and thank you for raising the ticket too.
I disabled partition inference, and used metadata refresh just as you did, and can confirm it works.
I am wondering what are the implications of not having a partition inference. Would there be performance penalties?
@rohitshetty @jdwills Quick update. Turns out this has been fixed in 24.2.3 and 23.2.3. And should be fixed in the next release cycle for Dremio Cloud.
There shouldn’t be a performance implication using the workaround. You should be able to see partition pruning happening in the raw profile of the query. IMO, partition inference is more of a convenience feature.
Thanks!
And should be fixed in the next release cycle for Dremio Cloud.
That is great to hear! Do you know the approximate time when that would be?
Thank you again for all your guidance so far!
This should be now fixed. It was part of the November 16th, 2023 update of Dremio Cloud: Changelog | Dremio Documentation.
1 Like