Pulling large datasets from Snowflake into Dremio

capnjosh · November 6, 2024, 4:25pm

I have a 700 million-row table (73GB) in Snowflake that I would like to copy into Dremio so it’s stored in a Nessie catalog and backed by object storage. I’m hoping to do some rought comparisons between Snowflake and Dremio on the same data.

When I do a simple “create table as” statement that selects from the Snowflake datasource and inserts into a table in my Nessie catalog, the query on Snowflake completes in 25 minutes, and then I see the Dremio host start downloading from snowflake. However, that download speed seems to be hitting an upper limit of 40Mbps (see below screenshot). This host is in Oracle Cloud, and it demonstrates much faster download speeds at other times.

Then, after exactly 6 hours of downloading, it threw this error:
Source 'Snowflake' returned error 'JDBC driver internal error: Max retry reached for the download of #chunk6855 (Total chunks: 10623) retry=7, error=net.snowflake.client.jdbc.SnowflakeSQLException: JDBC driver encountered communication error. Message: Error encountered when downloading a result chunk: HTTP status=400.

Here is the Grafana chart showing the CPU and network speed (cpu is indeed small - 2 vCores):

This seems really slow. Would this somehow be expected, or are there any well-known configs that would result in this kind of behavior?

Or, is Dremio really not built to ingest data via SQL queries from a remote data source like Snowflake? I guess I haven’t really seen examples of this kind of use case.

I’m running dremio-oss v25.1.1-202409260159070462-716c0676

balaji.ramaswamy · November 7, 2024, 6:54am

@capnjosh Let us start with the profile, would like to understand where the time is spent? Would you be able to send the profile of the failed job?

caseykarst · November 7, 2024, 5:51pm

@capnjosh JDBC is not an ideal data loading tool for large data sets. A more common approach is to use Snowflake COPY INTO to export the data to a S3 location and then use Dremio COPY INTO to load that data. Both of those statements take advantage of scalable implementations which will greatly increase your throughput and decrease the ingestion time.

capnjosh · November 7, 2024, 7:13pm

@caseykarst I think that confirms what I suspected.

If I want to migrate data out of Snowflake, I need to take that 2-step approach rather than try to “select” it directly from a Snowflake Datasource inside Dremio.

So querying data from Snowflake via Dremio should come with a caveat that you don’t want to pull too much data, as that’s not the intended use. If you need to pull more than 1GB or so of data from Snowflake, you have to either accept that it will take a long time, or you should dump the data from Snowflake into an external stage and then do a “copy into” from that S3 bucket into Dremio (or just make a view across that s3 bucket).

Am I correct in that conclusion?

Benny_Chow · November 13, 2024, 1:10am

Using CTAS to move 73GB of data is not a good approach. You would end up shuffling and sorting this data in memory many times just to reproduce a similar physical layout that you already have logically in Snowflake. Plus, the read from Snowflake is not parallelized.

I would explore a batch operation using COPY INTO or some incremental streaming approach with Snowflake CDC or Kafka Iceberg Connector.

Topic		Replies	Views
Dremio doesn't pushdown LEFT() to Snowflake	4	500	August 14, 2023
Poor response when joining from reflection to snowflake	9	2310	January 25, 2021
The UI preview is very slow compared to the run	11	1504	May 5, 2020
Error while querying snowflake	1	876	February 9, 2021
Big MySQL virtual dataset creation fails after about 1:15:00	2	1249	July 24, 2017

Pulling large datasets from Snowflake into Dremio

Related topics