Hello!
Currently, we are exploring the use of Dremio v1.24 in Kubernetes, and we are using the official Helm chart (dremio-cloud-tools/charts/dremio_v2 at master · dremio/dremio-cloud-tools · GitHub) to deploy it. Our main focus right now is to use dbt to transform the data in our data lake and store it in Iceberg tables, which will be accessible to data analysts.
To do so, we have explored dbt-dremio and setting up Iceberg tables in Dremio, but we have encountered some issues:
While working with Iceberg tables, we followed this documentation as a starting point: Dremio, specifically focusing on Amazon S3 data sources. We have an S3 bucket that contains the data and metadata folders for an Iceberg table. However, when we create an Iceberg table in Dremio using this location, the table appears to be empty, and we are unable to see the underlying data. However, we have observed the following: if we create another Iceberg table in Dremio using the syntax CREATE TABLE x AS SELECT * FROM y, where y is a Parquet dataset located in the data location of the Iceberg table in the S3 bucket, we are able, in fact, to see the data.
We were wondering if you could provide us with some guidance to solve the issues mentioned above and help us achieve our goal of transforming the data through dbt, storing it in Iceberg tables, and exposing it through Dremio.
Could you confirm the Dremio version you’re testing? v1.2.x is pretty old. For DML and Iceberg capabilities in Dremio, please try the newer versions. At the minimum, try v23.x and above.
@mirelagrigoras My team and I have been working with dbt-dremio for a while now, and I’m also having quite a bit of trouble with Iceberg in Dremio 24 and 24.1 releases.
We have a tooling stack we’ve come to rely upon, consisting of the Elyra Jupyterlab image, dbt-ipy, and dbt-dremio. It lets us be very fast and collaborative with our pipeline and data model work and directly submit things into Kubeflow for automation.
The current issue for me seems to be that, at random times, materializations with dbt will fail due to some breakdown of communication having to do with our choice of filesystem. If we materialize to a physical dataset volume using a NFS-based k8s pv, mapped to NAS storage, we will get:
Recordsets built successfuly in Dremio memory, but never appear in storage - resulting in java.io.FileNotFoundException,
Incremental recordsets (e.g. insert queries) that build, but the _dbt_temp tables are not dropped and the query hangs in RUNNING state forever, until I restart the master node,
Recordsets that partially build, but the metadata json or avro files used by Iceberg will never appear, causing the target recordset to be corrupted.
BUT! If we materialize to an S3 (Minio) source… Everything works fine. Fresh models and incremental inserts work as expected – though the performance isn’t the same as directly accessing the local filesystem – but no strange behavior.
Until Dremio 24 and its more serious commitment to Iceberg, I could materialize data on my nice, fast NAS filesystem all day without trouble. Now its a roll of the dice. I have one instance processing a colossal amount of records on a daily incremental ingest, it works great. A different instance with an identical configuration but different data, and it dies 80% of the time.
Is Dremio CTAS and UPDATE only meant to only materialize to S3 targets now? Is this something to do with dbt-dremio specifically? I can run the same queries manually and get the same random & irritating results, so I don’t think so. In either case, what value is there in local/NAS file storage if I can’t actually query my recordsets there? This is most frustrating.
@mirelagrigoras we were very successful up to Dremio 23 with our existing tooling. However, emboldened by our success there, the complexity of our models has also grown over time, so your mileage may vary.
For instance, the problem I describe above involves a notebook where we query our existing target dataset for date gaps, then use pyarrow to query our ingest cache (A redshift db we dump new ingests into, which is defined as a Dremio source) for those dates. We pass the dates to dbt as runtime variables and have it build a query to that cherry-picks those into _latest tables and does an incremental insert to the existing target. This removes our dependency on using redshift as the authoritative archive, and saves time and money on AWS bandwidth charges, since we don’t have to materialize the whole millions-of-records archive every day.
We weren’t doing anything so fancy in Dremio 23, not to say we couldn’t have. In any case, we don’t just see these filesystem errors on incremental materializations, so if I were able, I’d probably downgrade my version. I fear, however, that the newer versions of Iceberg that 24.1 uses have locked me into the future.
We were wondering if there is any chance you could outline the steps involved in configuring and setting up the dbt-Dremio integration with Iceberg? Any insights into the architecture and required dependencies would be immensely helpful. As you have successfully implemented this setup, we are eager to learn about any best practices or lessons learned along the way. Are there any specific considerations or pitfalls to avoid during the setup and subsequent usage of dbt-dremio with Iceberg?
We would greatly appreciate any resources, documentation, tutorials, or sample code that you found useful during your implementation. These would assist us (and others, for sure) in gaining a deeper understanding of the integration and effectively troubleshoot any issues that may arise.
Hello again!
As mentioned above, we do use in fact version 24. Any guidance how we could fix the issues in the original post, regarding dbt-dremio and Iceberg?
When using dbt-dremio (GitHub - dremio/dbt-dremio: dbt (data build tool) adapter for the Dremio 1), we encountered an error when generating and executing the DBT queries. It appears that Dremio does not support DML (Data Manipulation Language) and we could not get past this point.
While working with Iceberg tables, we followed this documentation as a starting point: Dremio, specifically focusing on Amazon S3 data sources. We have an S3 bucket that contains the data and metadata folders for an Iceberg table. However, when we create an Iceberg table in Dremio using this location, the table appears to be empty, and we are unable to see the underlying data. However, we have observed the following: if we create another Iceberg table in Dremio using the syntax CREATE TABLE x AS SELECT * FROM y, where y is a Parquet dataset located in the data location of the Iceberg table in the S3 bucket, we are able, in fact, to see the data.
It sounds like you are creating a table on a table instead of promoting a table. Can you promote the table? Dbt doesn’t have a promote function. You could probably have a start macro run that does alter pds xxx refresh metadata autopromote