Hi, We are experiencing duplicate data issue when we started using refreshing metadata per partition instead of full metadata refresh.
Datasource - aws glue table partitioned table with parquet files on s3 with around 100 of partitions.
Our Etl-like service updates partition data by creating new s3 file and doing partition switch using sibling subfolder and updating metadata per partition:
- s3 folder
data/partition_x=1/partition_y=1/100001/..files..
is mapped to partition (x=1,y=1) - new folder created
data/partition_x=1/partition_y=1/100002/..files..
with new version of files - partition (x=1,y=1) deleted using aws glue api
- partition (x=1,y=1) created with s3 location mapped to new folder 100002, previous folder with files is preserved to avoid Dremio file not found error in case metadata is not yet updated.
ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS (x= '1', y='1')
is called- Querying Dremio shows duplicated(i suspect old and new) data while data from new folder is expected.
- Running same query in Athena shows only new data as expected
- Running
ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS
multilpy times or with some short delay doesnt help. - Running full metadata update
ALTER TABLE table_name REFRESH METADATA
helps and new data is shown only.
As i understand using extra options like FORCE UPDATE and etc doent make any difference in case of aws glue s3 storage?
What are we doing wrong.
Side question - how expensive is full metadata refresh we use now as a hotfix for a table with 100 of s3 files. Does it force 100 s3 get object requests or mix or head/get requests? Does it load parquet files fully?
Dremio (aws edition) 24.0.1-202303312317430103-9ff7aeed