There are a lot of catalog implementations that Iceberg currently supports for common engines like Spark and Flink. I was wondering what the best catalog would be Iceberg and Dremio is used in a multi-engine environment.
This thread is inspired by the discussions in:
- Can Dremio be co-used with other compute engine for modifying Iceberg table?
- Incremental data reflection with Iceberg
I’d like to scope it to the “Dremio Software” edition only, since Arctic is available in Dremio Cloud and an obvious choice (unless the “Preview” disclaimer is a blocker for some). In regards to thread 1., I suggest at least initially also scoping to Dremio only reading data, in order to avoid discussions about concurrent writes.
The options known to me are:
- Glue catalog
- Hive catalog
- Hadoop catalog
- Nessie catalog (using an undocumented
- Roll your own support for the JDBC catalog, REST catalog or similar
- Wait for Arctic to come to the “Software” version
Short comments on the options above:
- Glue may be an option, but it is not an option for non-AWS clouds or on-prem deployments
- I would rather want to avoid all the complexities of Hive (unless you’re deeply invested in Hive already, but not all are - my company isn’t either)
- The lowest, common denominator. For one, engines doing writes need an explicit lock manager for coordinating writes, since it isn’t provided by the catalog.
- This might not even be officially supported, and as far as I can test, Dremio-wise it seems to be treated very similar to the Hadoop catalog
- This seems like a daunting task. It may also be in vein, since Dremio may add more catalogs in the future (Dremio’s roadmap isn’t open, so on-one knows)
- One could hope. I haven’t heard of any ambitions of it being added to the “Software” version (Dremio’s roadmap isn’t open, so on-one knows)
For context, in my company we’re currently:
- Running Dremio OSS on-prem
- Using a per-table Hadoop catalog
- Using an AWS S3 remote file source (MinIO) and manually configuring folders as Iceberg tables
- Using a custom built lockmanager to coordinate commits by writers
I’d like to do better and there are a lot of good reasons to use a proper catalog. Especially 4. above is an issue for me, since it makes writing to Iceberg risky. You need to be certain you have nailed the configuration of the LockManager, because, unlike other catalogs, if you forget or misconfigure the LockManager, it is still possible to write to the table. We have corrupted more than one table this way.
TL;DR - can we do better than the Hadoop catalog and still use Dremio for reading Iceberg tables - now or in the near future? @Benny_Chow may have some insights/recommendations, but what are others in the community doing?