METADATA REFRESH PER PARTITION - Duplicated data

vladislav-stolyarov · June 10, 2024, 7:27pm

Hi, We are experiencing duplicate data issue when we started using refreshing metadata per partition instead of full metadata refresh.
Datasource - aws glue table partitioned table with parquet files on s3 with around 100 of partitions.

Our Etl-like service updates partition data by creating new s3 file and doing partition switch using sibling subfolder and updating metadata per partition:

s3 folder data/partition_x=1/partition_y=1/100001/..files.. is mapped to partition (x=1,y=1)
new folder created data/partition_x=1/partition_y=1/100002/..files.. with new version of files
partition (x=1,y=1) deleted using aws glue api
partition (x=1,y=1) created with s3 location mapped to new folder 100002, previous folder with files is preserved to avoid Dremio file not found error in case metadata is not yet updated.
ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS (x= '1', y='1') is called
Querying Dremio shows duplicated(i suspect old and new) data while data from new folder is expected.
Running same query in Athena shows only new data as expected
Running ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS multilpy times or with some short delay doesnt help.
Running full metadata update ALTER TABLE table_name REFRESH METADATA helps and new data is shown only.
As i understand using extra options like FORCE UPDATE and etc doent make any difference in case of aws glue s3 storage?

What are we doing wrong.

Side question - how expensive is full metadata refresh we use now as a hotfix for a table with 100 of s3 files. Does it force 100 s3 get object requests or mix or head/get requests? Does it load parquet files fully?

Dremio (aws edition) 24.0.1-202303312317430103-9ff7aeed

balaji.ramaswamy · June 13, 2024, 7:44am

@vladislav-stolyarov If you do not add the partition clause, only the first ever refresh for a dataset is FULL, everything after that will be incremental, which means if there were 100 partitions, the first refresh will do all 100, subsequently if 5 are added, it will only do the new 5. Meanwhile we will look into the partition refresh issue

Topic		Replies	Views
Unable to do METADATA REFRESH FOR PARTITIONS Dremio Cloud	7	993	November 17, 2023
Syntax of Refresh Metadata for partitions Dremio University	11	2291	November 17, 2021
Unable to refresh metadata for the dataset (due to concurrent updates). Please retry Dremio Cloud	2	453	December 6, 2023
Meta data refresh failed	1	1002	April 3, 2022
Function not work for refreshng metadata and reflection by partition	1	441	November 1, 2023

METADATA REFRESH PER PARTITION - Duplicated data

Related topics