Iceberg: Choosing a catalog when using "Dremio Software" with other compute engines

There are a lot of catalog implementations that Iceberg currently supports for common engines like Spark and Flink. I was wondering what the best catalog would be Iceberg and Dremio is used in a multi-engine environment.

This thread is inspired by the discussions in:

  1. Can Dremio be co-used with other compute engine for modifying Iceberg table?
  2. Incremental data reflection with Iceberg

I’d like to scope it to the “Dremio Software” edition only, since Arctic is available in Dremio Cloud and an obvious choice (unless the “Preview” disclaimer is a blocker for some). In regards to thread 1., I suggest at least initially also scoping to Dremio only reading data, in order to avoid discussions about concurrent writes.

The options known to me are:

  1. Glue catalog
  2. Hive catalog
  3. Hadoop catalog
  4. Nessie catalog (using an undocumented services.nessie.remote-uri config in dremio.conf)
  5. Roll your own support for the JDBC catalog, REST catalog or similar
  6. Wait for Arctic to come to the “Software” version

Short comments on the options above:

  1. Glue may be an option, but it is not an option for non-AWS clouds or on-prem deployments
  2. I would rather want to avoid all the complexities of Hive (unless you’re deeply invested in Hive already, but not all are - my company isn’t either)
  3. The lowest, common denominator. For one, engines doing writes need an explicit lock manager for coordinating writes, since it isn’t provided by the catalog.
  4. This might not even be officially supported, and as far as I can test, Dremio-wise it seems to be treated very similar to the Hadoop catalog
  5. This seems like a daunting task. It may also be in vein, since Dremio may add more catalogs in the future (Dremio’s roadmap isn’t open, so on-one knows)
  6. One could hope. I haven’t heard of any ambitions of it being added to the “Software” version (Dremio’s roadmap isn’t open, so on-one knows)

For context, in my company we’re currently:

  1. Running Dremio OSS on-prem
  2. Using a per-table Hadoop catalog
  3. Using an AWS S3 remote file source (MinIO) and manually configuring folders as Iceberg tables
  4. Using a custom built lockmanager to coordinate commits by writers

I’d like to do better and there are a lot of good reasons to use a proper catalog. Especially 4. above is an issue for me, since it makes writing to Iceberg risky. You need to be certain you have nailed the configuration of the LockManager, because, unlike other catalogs, if you forget or misconfigure the LockManager, it is still possible to write to the table. We have corrupted more than one table this way.

TL;DR - can we do better than the Hadoop catalog and still use Dremio for reading Iceberg tables - now or in the near future? @Benny_Chow may have some insights/recommendations, but what are others in the community doing?

Hi @wundi,

thanks for your great post describing your line of thinking very clearly. I have very similar thoughts and questions. Did you every receive an answer or how did you proceed?

Many thanks, Tim

@wundi and Dr Tim,

Let me discuss this internally and get back here

Thanks
Bali

Hi Tim,

Frankly, we are still running in exactly the way I described. However, a lot has changed over the last couple of years, so we are hoping to make the switch soon. We have narrowed our catalog options to a few:

  1. REST catalog
    1.1 The Tabular rest catalog on top of a JDBC catalog
    1.2 Polaris
  2. Nessie

The consensus across all the technologies we are using is that they all support the REST catalog or will soon, so that is where we’re heading. Based on Dremio announcing support for the REST catalog, we had hoped that support would have landed in the recently released 25.1, but that is not the case. And since the roadmap isn’t open, we’re not sure when that’ll happen.

If we’d want to move now and have Dremio support, the only option is to go for Nessie as a catalog. Dremio does have support for Nessie, but the REST catalog api support is still experimental in Nessie.
So, we’d either test Nessie REST api support extensively or do a hybrid of option 1 and 2 above, using Nessie as the underlying catalog for the Tabular REST catalog implementation. However, that also needs to be tested rather well, before I’d be comfortable running that in production.

The reason we haven’t adopted Nessie already is that not all of our tech stack support Nessie, as well as us being reluctant to take on the additional GC maintenance required for running Nessie properly. It might not be a big deal, it is additional complexity I’d like to avoid unless strictly necessary.

The last and rather obvious option is to wait until REST catalog support is released from Dremio. Currently, we do have some non-production technologies that don’t support the Hadoop catalog. In order to use those, someone hacked a bit on the Tabular catalog to support exposing a Hadoop catalog as a REST catalog - one REST catalog per Hadoop catalog. This obviously doesn’t scale to many tables as well as a pain to manage, but it does serve as a workaround for us for now.

Hope that helps @tid

1 Like

@balaji.ramaswamy, were you able to check on this?

@wundi, thanks for your detailed opinion on various options and this helped me explore a lot of options and these are my thoughts.

The tabular rest doesn’t seem to be maintained enough with limited docs and there are high CVEs opened against their docker images which hampers adoption in an enterprise.

To adopt Nessie, I don’t see an use case right away other the REST iceberg backend and feel the same as you about the maintenance not to mention the maintenance of iceberg tables is complicated with Nessie.

Nessie - Apache Iceberg™.

Polaris seems to be promising but it is yet to have it’s first release and with no UI support it will be cumbersome to manage Access Control.

A plain JDBC catalog would have been easy if Dremio Software had a support for that.

For what’s its worth, we may end up using Iceberg on S3 with a lock manager but would like to know your implementation pitfalls or any open source lock manager you can recommend that’s worth trying out!!

Many thanks, Steen! @wundi

I lost track of the discussion because of other topics. There doesn’t seem to be much official communication about the catalog topic from Dremio, unfortunately. I heared some rumours at the first “Data Lakehouse Bytes” meetup in Germany recently that something is coming with 25.2 and that Dremio is leaning towards Polaris, and later generic support for Iceberg REST.

Would love to hear something official though. @bali maybe? :wink:

Many thanks, Tim

@tid

Dr Tim, I am discussing this internally and will get back on this thread

Thanls
Bali

I just tested the newly released 25.2.0, and it seems to work with our custom version of the Tabular REST catalog (only custom because we added support for Hadoop Catalog Tables)

Looks like generic REST catalog support to me, even though the release notes specifically mentions Polaris and Unity

The support for REST catalog seems to be there now in the new 25.2.0 release (if you set the plugins.restcatalog.enabled support setting to true).
We’ll explore that to confirm it’ll work the way we would expect. We’re able to list our tables by exposing our file-based Hadoop catalogs via a modified Tabular REST catalog, so initial trials are positive.

Polaris seems to be promising but it is yet to have it’s first release and with no UI support it will be cumbersome to manage Access Control.
Polaris seems promising, but there are some blockers and concerns for us:

  1. We run on-prem and use MinIO, which Polaris doesn’t currently support

  2. I am a bit concerned about the potential performance overhead of credential vending

  3. Initial bootstrapping seems a bit on the complicated side (principals, catalog/principal roles, RBAC etc.)

  4. is obviously a showstopper for us. 2. we’d have to test and estimate the impact on performance, if any. 3. is only based on glossing the documentation, so it may turn out to be rather trivial in practice.

For what’s its worth, we may end up using Iceberg on S3 with a lock manager but would like to know your implementation pitfalls or any open source lock manager you can recommend that’s worth trying out!!

I would advise against it, if at all possible. We implemented our own, custom locking implementation, based on the AWS DynamoDB lock manager and backed by Zookeeper (because we’re using that already for other purposes), but it took some time before we were confident that the issues we were having wasn’t with out locking implementation. And even when it works properly, any client has to “opt in” and do the right thing. for the locking to be correct. One bad actor, willingly or not, is enough to corrupt the table. You can reasonably control it in a super small organization, but it becomes difficult rather quickly.

We decided to do the above when the catalog landscape (and support) was vastly different. I’d go for any catalog if starting from scratch today. One new candidate to the list would be this one:

It still seems like early days, but they do seem so suffer from some of the same issues we have running on-prem. I haven’t tested their catalog in practice, but I thought I’d mention it