If one is starting from scratch how is it that Iceberg / Nessie write to postgres - I would think I would need to create the table(s) and communicate that in a config file.
In the case of Dremio, is Dremio writing that catalog information to Postgres such that the Iceberg metadata files pointers are referenced in the Postgres tables?
It’s not clear to me who does what with Iceberg Nessie and Dremio. On the one hand, I get the impression that Nessie and Iceberg are code libraries from which Dremio can call certain functions. On the other hand, if Dremio has created its own SQL statement Dremio can use that directly without referencing those libraries. But then how does the Postgres catalog get updated? The answer could be “both”.
Thanks in advance for your patience and your kind response.
Dremio doesn’t write to the Nessie’s persistent store directly. I think Postgres in your case. Dremio talks to Nessie through Nessie’s APIs to retrieve Iceberg metadata locations and manage branching/multi-table transaction features.
Benny, Thanks for your reply! Your answer assumes I have Nessie and Iceberg deployed. I guess another way to ask my question, if I don’t have a catalog configured, what would I have to do? How do I ‘install’ Iceberg? I am assuming that would involve postgres but it’s not clear to me how I would implement Iceberg and or Nessie.
I think your main decision is which Iceberg catalog do you want to use? Once you decide on the catalog and setup that catalog as a source in Dremio, SELECT and DML queries on Iceberg tables just works. There’s nothing to ‘install’ for Iceberg table format specifically.
I think if you are just trying out Dremio, you can start with a filesystem source like S3 or NAS which uses the Hadoop Iceberg catalog behind the scenes. Once you are comfortable with Dremio and Iceberg tables, you can try out a more production ready catalog like Nessie.
Great question! Let me clarify how Iceberg, Nessie, Dremio, and Postgres interact.
How does Iceberg/Nessie write to Postgres?
In a typical setup with Nessie and Iceberg, Postgres is used as the metadata store to keep track of table versions, branches, commits, and other metadata related to Iceberg tables. Nessie, which acts like a Git for your data lake, stores these metadata pointers in Postgres, not the actual data itself. The data stays in your object store (like MinIO, S3, etc.). You don’t need to manually create tables in Postgres—Nessie manages that for you as it handles the metadata catalog.
What about Dremio?
Dremio integrates with Nessie to manage and query Iceberg tables. It doesn’t write directly to Postgres. Instead, Dremio interacts with Nessie for metadata and to read/write Iceberg tables, which are stored in your object store (e.g., MinIO). So, when Dremio runs a SQL query, it fetches metadata from Nessie (which might be in Postgres), but the actual data itself is handled by Iceberg in the object store.
SQL Example:
If you’re working in Dremio, you could run a query like this to create a table from CSV data stored in MinIO:
CREATE TABLE nessie.my_table
IN NESSIE.my_branch
AS SELECT * FROM my_minio_csv;
This SQL will create an Iceberg table with the data from MinIO, and Nessie will handle the versioning and metadata.
How is the Postgres catalog updated?
Nessie handles all updates to the Postgres catalog. Whenever you make changes (e.g., adding new data or branching your datasets), Nessie records the metadata in Postgres. Dremio interacts with Nessie for table operations, and Nessie ensures that the metadata stored in Postgres is always up-to-date.
Testing this Setup
The easiest way to try this setup is with Docker Compose, which can help you run Dremio, Nessie, Postgres, and MinIO locally. Here’s a link to the Docker Compose setup that will get you started:
GitHub repo for Docker Compose setup:
Specific tutorial for Spark & Dremio integration:
Key Points:
Postgres is used by Nessie to store metadata but isn’t required for every setup (you can run Nessie in-memory for local development).
Dremio doesn’t write directly to Postgres but interacts with Nessie to query and manage Iceberg tables.
Iceberg manages the actual data in your object store (like MinIO), while Nessie keeps track of the versions and metadata.
Let me know if this helps or if you have any other questions!