I’m testing the usage of manual refresh metadata for specific partitions. My datasets has four partition levels (dir0, dir1, dir2, dir3 - year, month, day, hour). There is any way to refresh metadata for a specific month or day? For example like this:
ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir2" = '8');
or
ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir1" = 'October');
I have tested, but this is not the correct syntax and I don’t found any example of usage with more than one partition level.
This syntax did not give an execution error, but seems this is refreshing metadata for all dataset. For example, I have run this command:
ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021', "dir1" = 'November', "dir2" = '9', "dir3" = '11');
But the metadata for partition 2021/November/9/12 was also refreshed. Furthermore, the execution time for this refresh was very like to a commom refresh (without specify the partition). This behavior was repeated 3 times with another partitions, so that seems not the expected behavior.
@eduardoslopes The background refresh would refresh all new partitions and all changed partitions. Is it possible the background metadata refresh would have completed before your ALTER PDS?
@balaji.ramaswamy I think this is not happening, because I have changed the metadata refresh interval to 24 hours to avoid this possibility. As I have tried to do manual refresh several times, I think this is not possible.
I had similar issues. For me, the root cause was that I was testing the functionality in a local container with distStorage set to “local”. Incremental metadata refresh only works if you’re using a non-local distStorage location (Dremio stores the Iceberg metadata there).
Unfortunately, the syntax works fine, there is no error thrown, but the behavior is just not what you’d expect.
@eduardoslopes That is correct, to benefit all features of unlimited splits and Iceberg data, the metadata or dist store needs to got to a distributed storage liek S3/HDFS/Azure Storage
My test enviroment are already using distStorage to a S3 bucket. Maybe my issue is because my data are json formatted. That makes sense? Are this features available only for iceberg formatted datasets?
This is probably the problem, but the release notes did not correlate the section Unlimited Splits with the section that explain about manual metadata refresh. So I thought there was no dependency between the features.
@balaji.ramaswamy@tid@lenoyjacob there is any plan to implement this metadata refresh improvements also for json formatted datasets on future versions?