How to make COPY INTO idempotent? (to prevent duplicates)

Hello community,

Is there a architecture design pattern to make ingesting JSON data in folders idempotent, consider a use case we have a pipeline ingesting JSON data landing daily in a folder, lets say we are using dbt-dremio with airflow for orchestration
If we run COPY INTO on the same source and folder twice (if some one re-triggers the DAG), I can see duplicate data being loaded into the iceberg table, is there a way to avoid this and build idempotent pipelines?
Whats the best practice for writing data into iceberg tables considering pyiceberg doesn’t yet support merges

Ref - COPY INTO | Dremio Documentation

@robathija Have you looked at this?

Thanks for getting back @balaji.ramaswamy , I just took a look, this would work but is this available on Dremio Software OSS or any plans of rolling it out in the near future?
Appreciate your response

Another reason I ask about Dremio software is that looks like Dremio Cloud doesn’t support GCS (only Azure Data Lake Gen2 and AWS S3) and we are on GCP

Hi Robathija,
We Pipes is an Dremio Enterprise feature that is currently available for Dremio Cloud and will be available for Dremio Software in the future for S3 sources. GCP is something we have on the roadmap.
Regarding how to implement this yourself on Dremio OSS:

  • Create a Iceberg Table to track files loaded. Needs to include file path and source table.
  • Before each load, list all files in path with List objects  |  Cloud Storage  |  Google Cloud
  • Join returned proposed load files with Iceberg table of previously loaded files and select only the files that have not been loaded before
  • Construct COPY INTO statement with explicit list of accepted files.

Hope that helps!
Casey

1 Like