Syntax of Refresh Metadata for partitions

Hello,

I’m testing the usage of manual refresh metadata for specific partitions. My datasets has four partition levels (dir0, dir1, dir2, dir3 - year, month, day, hour). There is any way to refresh metadata for a specific month or day? For example like this:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir2" = '8');

or

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir1" = 'October');

I have tested, but this is not the correct syntax and I don’t found any example of usage with more than one partition level.

Try comma separated values, like:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ("dir0" = '2021', "dir1" = 'November', "dir2" = '01', "dir3" = '01');
ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ("dir0" = '2021', "dir1" = 'October', "dir2" = '01', "dir3" = '01');

Note that a command must include all partition levels.

Thanks, @lenoyjacob!

This syntax did not give an execution error, but seems this is refreshing metadata for all dataset. For example, I have run this command:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021',  "dir1" = 'November', "dir2" = '9', "dir3" = '11');

But the metadata for partition 2021/November/9/12 was also refreshed. Furthermore, the execution time for this refresh was very like to a commom refresh (without specify the partition). This behavior was repeated 3 times with another partitions, so that seems not the expected behavior.

@eduardoslopes The background refresh would refresh all new partitions and all changed partitions. Is it possible the background metadata refresh would have completed before your ALTER PDS?

@balaji.ramaswamy I think this is not happening, because I have changed the metadata refresh interval to 24 hours to avoid this possibility. As I have tried to do manual refresh several times, I think this is not possible.

@lenoyjacob @balaji.ramaswamy Have you any idea how to solve that?

Hey @eduardoslopes. I’m unable to reproduce what you are seeing. I’ve got a dataset with two sets of partitions 2021/09/29/14 and 2021/09/30/14.

Check count of existing records:

It’s got 5000 records per partition. Let’s add a new parquet file to both partitions via AWS S3.

Let’s refresh metadata only 2021/09/29/14. Expectation is to see only the count for this partition to increase.

Check count again.

As you can see only the count in 2021/09/29/14 changed. The other partition count remained the same despite adding a new file.

Hello @eduardoslopes,

I had similar issues. For me, the root cause was that I was testing the functionality in a local container with distStorage set to “local”. Incremental metadata refresh only works if you’re using a non-local distStorage location (Dremio stores the Iceberg metadata there).
Unfortunately, the syntax works fine, there is no error thrown, but the behavior is just not what you’d expect.

Thanks, Tim

@tid Thanks a lot for catching that,

@eduardoslopes That is correct, to benefit all features of unlimited splits and Iceberg data, the metadata or dist store needs to got to a distributed storage liek S3/HDFS/Azure Storage

Thanks @lenoyjacob @tid @balaji.ramaswamy,

My test enviroment are already using distStorage to a S3 bucket. Maybe my issue is because my data are json formatted. That makes sense? Are this features available only for iceberg formatted datasets?

Hi @eduardoslopes,

that is most likely the root cause.
The release notes (Dremio) say:

(quoting)
When preview access is activated, the split limitation is removed for the following types of datasets:

  • FileSystem sources (S3, ADLS, HDFS) using:
    • Parquet formatted tables
    • Iceberg formatted tables
    • Delta Lake formatted tables
  • Hive sources (Hive 2 and Hive 3) using:
    • Parquet formatted tables
    • Avro formatted tables
    • ORC formatted tables (non-transactional only)

(end of quote)

Best regards, Tim

Thanks @tid,

This is probably the problem, but the release notes did not correlate the section Unlimited Splits with the section that explain about manual metadata refresh. So I thought there was no dependency between the features.

@balaji.ramaswamy @tid @lenoyjacob there is any plan to implement this metadata refresh improvements also for json formatted datasets on future versions?