Hi, We are experiencing duplicate data issue when we started using refreshing metadata per partition instead of full metadata refresh.
Datasource - aws glue table partitioned table with parquet files on s3 with around 100 of partitions.

Our Etl-like service updates partition data by creating new s3 file and doing partition switch using sibling subfolder and updating metadata per partition:

  1. s3 folder data/partition_x=1/partition_y=1/100001/..files.. is mapped to partition (x=1,y=1)
  2. new folder created data/partition_x=1/partition_y=1/100002/..files.. with new version of files
  3. partition (x=1,y=1) deleted using aws glue api
  4. partition (x=1,y=1) created with s3 location mapped to new folder 100002, previous folder with files is preserved to avoid Dremio file not found error in case metadata is not yet updated.
  5. ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS (x= '1', y='1') is called
  6. Querying Dremio shows duplicated(i suspect old and new) data while data from new folder is expected.
  7. Running same query in Athena shows only new data as expected
  8. Running ALTER TABLE table_name REFRESH METADATA FOR PARTITIONS multilpy times or with some short delay doesnt help.
  9. Running full metadata update ALTER TABLE table_name REFRESH METADATA helps and new data is shown only.
    As i understand using extra options like FORCE UPDATE and etc doent make any difference in case of aws glue s3 storage?

What are we doing wrong.

Side question - how expensive is full metadata refresh we use now as a hotfix for a table with 100 of s3 files. Does it force 100 s3 get object requests or mix or head/get requests? Does it load parquet files fully?

Dremio (aws edition) 24.0.1-202303312317430103-9ff7aeed

@vladislav-stolyarov If you do not add the partition clause, only the first ever refresh for a dataset is FULL, everything after that will be incremental, which means if there were 100 partitions, the first refresh will do all 100, subsequently if 5 are added, it will only do the new 5. Meanwhile we will look into the partition refresh issue