Syntax of Refresh Metadata for partitions

eduardoslopes · November 8, 2021, 8:00pm

Hello,

I’m testing the usage of manual refresh metadata for specific partitions. My datasets has four partition levels (dir0, dir1, dir2, dir3 - year, month, day, hour). There is any way to refresh metadata for a specific month or day? For example like this:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir2" = '8');

or

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021' AND "dir1" = 'November' AND "dir1" = 'October');

I have tested, but this is not the correct syntax and I don’t found any example of usage with more than one partition level.

lenoyjacob · November 8, 2021, 11:28pm

Try comma separated values, like:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ("dir0" = '2021', "dir1" = 'November', "dir2" = '01', "dir3" = '01');
ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ("dir0" = '2021', "dir1" = 'October', "dir2" = '01', "dir3" = '01');

Note that a command must include all partition levels.

eduardoslopes · November 9, 2021, 2:09pm

Thanks, @lenoyjacob!

This syntax did not give an execution error, but seems this is refreshing metadata for all dataset. For example, I have run this command:

ALTER TABLE "mydataset" REFRESH METADATA FOR PARTITIONS ( "dir0" = '2021',  "dir1" = 'November', "dir2" = '9', "dir3" = '11');

But the metadata for partition 2021/November/9/12 was also refreshed. Furthermore, the execution time for this refresh was very like to a commom refresh (without specify the partition). This behavior was repeated 3 times with another partitions, so that seems not the expected behavior.

balaji.ramaswamy · November 10, 2021, 6:43am

@eduardoslopes The background refresh would refresh all new partitions and all changed partitions. Is it possible the background metadata refresh would have completed before your ALTER PDS?

eduardoslopes · November 10, 2021, 11:48am

@balaji.ramaswamy I think this is not happening, because I have changed the metadata refresh interval to 24 hours to avoid this possibility. As I have tried to do manual refresh several times, I think this is not possible.

eduardoslopes · November 12, 2021, 6:33pm

@lenoyjacob @balaji.ramaswamy Have you any idea how to solve that?

lenoyjacob · November 12, 2021, 10:07pm

Hey @eduardoslopes. I’m unable to reproduce what you are seeing. I’ve got a dataset with two sets of partitions 2021/09/29/14 and 2021/09/30/14.

Check count of existing records:

It’s got 5000 records per partition. Let’s add a new parquet file to both partitions via AWS S3.

Let’s refresh metadata only 2021/09/29/14. Expectation is to see only the count for this partition to increase.

Check count again.

As you can see only the count in 2021/09/29/14 changed. The other partition count remained the same despite adding a new file.

tid · November 15, 2021, 9:07am

Hello @eduardoslopes,

I had similar issues. For me, the root cause was that I was testing the functionality in a local container with distStorage set to “local”. Incremental metadata refresh only works if you’re using a non-local distStorage location (Dremio stores the Iceberg metadata there).
Unfortunately, the syntax works fine, there is no error thrown, but the behavior is just not what you’d expect.

Thanks, Tim

balaji.ramaswamy · November 15, 2021, 2:01pm

@tid Thanks a lot for catching that,

@eduardoslopes That is correct, to benefit all features of unlimited splits and Iceberg data, the metadata or dist store needs to got to a distributed storage liek S3/HDFS/Azure Storage

eduardoslopes · November 16, 2021, 2:40pm

Thanks @lenoyjacob @tid @balaji.ramaswamy,

My test enviroment are already using distStorage to a S3 bucket. Maybe my issue is because my data are json formatted. That makes sense? Are this features available only for iceberg formatted datasets?

tid · November 16, 2021, 9:41pm

Hi @eduardoslopes,

that is most likely the root cause.
The release notes (Dremio) say:

(quoting)
When preview access is activated, the split limitation is removed for the following types of datasets:

FileSystem sources (S3, ADLS, HDFS) using:
- Parquet formatted tables
- Iceberg formatted tables
- Delta Lake formatted tables
Hive sources (Hive 2 and Hive 3) using:
- Parquet formatted tables
- Avro formatted tables
- ORC formatted tables (non-transactional only)

(end of quote)

Best regards, Tim

eduardoslopes · November 17, 2021, 12:07pm

Thanks @tid,

This is probably the problem, but the release notes did not correlate the section Unlimited Splits with the section that explain about manual metadata refresh. So I thought there was no dependency between the features.

@balaji.ramaswamy @tid @lenoyjacob there is any plan to implement this metadata refresh improvements also for json formatted datasets on future versions?

Topic		Replies	Views
Unable to do METADATA REFRESH FOR PARTITIONS Dremio Cloud	7	993	November 17, 2023
METADATA REFRESH PER PARTITION - Duplicated data	1	143	June 13, 2024
Meta data refresh failed	1	1002	April 3, 2022
Metadata Refresh - Stop Refreshing for old Datasets Dremio University	1	1189	December 20, 2021
Function not work for refreshng metadata and reflection by partition	1	441	November 1, 2023

Syntax of Refresh Metadata for partitions

Related topics