"limit 100" in query is not always respected

There are times when the “Preview” returns no rows. In those cases I have to use “Run”. That’s fine, but when I submit a query where I provide a “limit 1000”, sometimes Dremio submits a query to Snowflake that has no limit at all. That is, Dremio requests the entire table from Snowflake. This essentially breaks my ability to craft a query on some tables because it takes forever for the query to return. Submitting the same query to Snowflake is nearly instant because it has the “limit 100”.

Why does Dremio do this? I can imagine it would have to do this in case of some tables that have, e.g. s3-stored csv files. But in the case of Snowflake, this should never happen.

Is there some way to prevent Dremio from doing this?

@capnjosh

I see you have 2 questions

#1 Why does Preview sometimes does not return any result - This could happen if the filter is applied on the sample. Do you have a profile?

#2 Dremio retrieves all records from Snowflake - This should not happen, what can happen though is that Dremio would have selected a batch and even though there is a limit, the batch will go through but that still not be all the records in the Snowflake table, any chance you have the profile for this too?

I replied to the other thread about Preview sometimes not returning any results.

Regarding getting a profile for this “limit 100” thing, here is a profile:
37f3dcab-46f7-40d4-876d-369b44ca3177.zip (22.1 KB)

This is the query:

SELECT
cast(player_start_date as date) as player_start_date,
cast(event_date as date) as eventDate,
datediff(cast(event_date as date),cast(player_start_date as date)) as days_between
from "Snowflake-PQ3".PQ3.RAW."EVENT_GAMETRANSACTION_RAW"
limit 100

It returns with the expected results, but I can see in Snowflake that Dremio submitted a query to get all the rows from the table rather than just 100.

@capnjosh As you can see from the plan, the LIMIT is applied above the scan so 1 batch is returned (see number of batched in JDBC_SUB_SCAN) which is 3,968 records and then the LIMIT is applied. Can you please send profile for below query?

SELECT player_start_date
from "Snowflake-PQ3".PQ3.RAW."EVENT_GAMETRANSACTION_RAW"
limit 100

Attached is the profile for this query:

[53ad29c5-a5e7-46b6-828e-42087962f13f.zip|attachment](upload://7vCtbBDcJn9ebJZ9S6396iUANyv.zip) (21.3 KB)

SELECT player_start_date
from "Snowflake-PQ3".PQ3.RAW."EVENT_GAMETRANSACTION_RAW"
limit 100

Of note, it executed and completed very quickly. Snowflake saw a query come through that included the “limit 100”:

-- this comes from the "query history" view in Snowflake
SELECT "EVENT_GAMETRANSACTION_RAW"."PLAYER_START_DATE"
FROM "PQ3"."RAW"."EVENT_GAMETRANSACTION_RAW" LIMIT 100

Whereas with this query:

SELECT
cast(player_start_date as date) as player_start_date,
cast(event_date as date) as eventDate,
datediff(cast(event_date as date),cast(player_start_date as date)) as days_between
from "Snowflake-PQ3".PQ3.RAW."EVENT_GAMETRANSACTION_RAW"
limit 100

Snowflake sees only this come through (no “limit”):

-- this comes from the "query history" view in Snowflake
SELECT "EVENT_GAMETRANSACTION_RAW"."PLAYER_START_DATE", "EVENT_GAMETRANSACTION_RAW"."EVENT_DATE"
FROM "PQ3"."RAW"."EVENT_GAMETRANSACTION_RAW"

It appears that Dremio decided to get all the raw data from Snowflake and then use its own cast() and datediff() functions before it applied the “limit”.

I’m seeing this same problem with other queries like this, where it’s a select query with some transformations on the columns and a “group by” clause, e.g. remove double-quotes, cast to a different data type and then group by the resulting values, etc.

Dremio sends a query to Snowflake that returns the relevant columns, but all the rows; and then Dremio applies the transformations and grouping. This results in very slow query times when Dremio is geographically far away from Snowflake; it downloads all the raw data, when it would be dramatically faster to just have Snowflake do the transformations and grouping.

The scary part is that over time a query via that had been performant could suddenly become intensely slow; or a reflection could suddenly start taking hours to complete, when previously it was fast.

@capnjosh Should not do a FTS (Full Table Scan), as I had said, atleast 1 batch needs to complete which in your case ws 4K records. Have you tried a similar query against Postgres or SQL server or Oracle or MySQL? What happens? The LIMIT is expected to be pushed down

I haven’t tried this with any other database engines - Snowflake is where the vast majority of our data resides. I’ll see if I can get something set up and try it out.