How to make COPY INTO idempotent? (to prevent duplicates)

robathija · June 27, 2024, 7:15pm

Hello community,

Is there a architecture design pattern to make ingesting JSON data in folders idempotent, consider a use case we have a pipeline ingesting JSON data landing daily in a folder, lets say we are using dbt-dremio with airflow for orchestration
If we run COPY INTO on the same source and folder twice (if some one re-triggers the DAG), I can see duplicate data being loaded into the iceberg table, is there a way to avoid this and build idempotent pipelines?
Whats the best practice for writing data into iceberg tables considering pyiceberg doesn’t yet support merges

Ref - COPY INTO | Dremio Documentation

balaji.ramaswamy · July 12, 2024, 5:06am

@robathija Have you looked at this?

robathija · July 12, 2024, 1:58pm

Thanks for getting back @balaji.ramaswamy , I just took a look, this would work but is this available on Dremio Software OSS or any plans of rolling it out in the near future?
Appreciate your response

robathija · July 12, 2024, 2:04pm

Another reason I ask about Dremio software is that looks like Dremio Cloud doesn’t support GCS (only Azure Data Lake Gen2 and AWS S3) and we are on GCP

caseykarst · July 15, 2024, 4:21pm

Hi Robathija,
We Pipes is an Dremio Enterprise feature that is currently available for Dremio Cloud and will be available for Dremio Software in the future for S3 sources. GCP is something we have on the roadmap.
Regarding how to implement this yourself on Dremio OSS:

Create a Iceberg Table to track files loaded. Needs to include file path and source table.
Before each load, list all files in path with List objects | Cloud Storage | Google Cloud
Join returned proposed load files with Iceberg table of previously loaded files and select only the files that have not been loaded before
Construct COPY INTO statement with explicit list of accepted files.

Hope that helps!
Casey

Topic		Replies	Views
Automating Iceberg table update Apache Iceberg	3	100	August 19, 2024
Create Table NOT as iceberg	9	128	April 5, 2025
Cannot copy data	7	709	May 16, 2023
Keep daily snapshot from source tables Dremio University	3	908	February 15, 2023
Does dremio handles incremental Load?	4	362	January 12, 2024

How to make COPY INTO idempotent? (to prevent duplicates)

Related topics