So I have seen this previously too - This query did a pull from snowflake in about 18 mins and the majority of the rest of the time (before it failed with lack of disk space), it chugged away with just one core processing - This has the effect of making it much slower than snowflake.
I don’t really understand why the one core only is used - but it’s causing me real performance issues…
c8b61204-88e1-48ab-aeae-6cc134ab9d4e.zip (32.2 KB)
The reason we are executing on only one core is because you have only one reflection Parquet file and in that only one group which decides the number of splits which in turn decides the number of parallel threads. Number of rowgroups upto 70% of the total number of cores on that node will decide the degree of parallelism. I see you query is accelerated so when you create the reflection you need to decide on the right partition column so you get more splits, below is a best practices document for creating reflections
thanks for the white paper - I will read that.
You said I had one reflection parquet, but the job details show 3 - how can I tell which one causes the single core usage ?
There are 4 scans
- JDBC scan for F_STAY always single threaded
- RAW reflection on snowflake.KLDW.PUBLIC.D_DATE has just 1 split so single threaded
- RAW reflection on KLDW.LOOKUP.BRAND_COMP_SET_LOOKUP has just 1 split so single threaded
- RAW reflection on snowflake.KLDW.PUBLIC.D_PROPERTY has 23 splits so can have parallel threads but unfortunately has only 45K records and we spin up the next thread only at 100K
I just had a closer look on why your query is slow as initially you wanted to know the reason for single threaded. Your query is slow not because of the 3 reflections (#2, #3 and #4) but because of the JDBC scan #1 which is scanning 3,032,397,184 records.
If you run this SQL on the source directly, how much does it take. Preferably run it from a client installed on the Dremio Coordinator
SELECT “OCCUPANCY_DT_ID”, “ROOM_COUNT”, “KL_TA_CODE”, “KL_TA_CATEGORY_ID”, “KL_PROP_ID”, “KL_RATE_ID”, “KL_SOB_ID”, “KL_CHANNEL_COST_AMT”, “KL_TRANSACTION_COST_AMT”, “KL_AMENITY_COST_AMT”, “KL_LOYALTY_COST_AMT”, “KL_GUEST_PAID_REVENUE_AMT”, “KL_HOTEL_COLLECTED_REVENUE_AMT”, “KL_COPE_REVENUE_AMT”, “KL_LOYALTY_FLAG”, “KL_NO_SHOW_FLAG”, “KL_CANCELLATION_FLAG”
FROM “KLDW”.“PUBLIC”.“F_STAY”) AS “F_STAY”
WHERE “KL_SOB_ID” IS NOT NULL
Thanks for the response. The full query direct on snowflake takes around 2.25 hrs - about an hour less then dremio.
The jdbc portion on snowflake of the dremio query takes just under 18 mins to complete.
Here’s the history from snowflake (for the portion that dremio sent it) - Note this is using a SMALL warehouse
How often does the JDBC data change? Can you create a reflection for that dataset and keep at one day refresh?
It changes multiple times a day for three weeks a month.
I tried a reflection but gave in after 12 hrs of execution.
However an “unload” from snowflake only took 2 hrs and the reflection on that was pretty quick so that may be a way forward