Dremio uses one core for processing

mfarmer · January 10, 2020, 11:20pm

So I have seen this previously too - This query did a pull from snowflake in about 18 mins and the majority of the rest of the time (before it failed with lack of disk space), it chugged away with just one core processing - This has the effect of making it much slower than snowflake.

I don’t really understand why the one core only is used - but it’s causing me real performance issues…

c8b61204-88e1-48ab-aeae-6cc134ab9d4e.zip (32.2 KB)

balaji.ramaswamy · January 11, 2020, 2:55pm

@mfarmer

The reason we are executing on only one core is because you have only one reflection Parquet file and in that only one group which decides the number of splits which in turn decides the number of parallel threads. Number of rowgroups upto 70% of the total number of cores on that node will decide the degree of parallelism. I see you query is accelerated so when you create the reflection you need to decide on the right partition column so you get more splits, below is a best practices document for creating reflections

mfarmer · January 11, 2020, 3:04pm

thanks for the white paper - I will read that.

You said I had one reflection parquet, but the job details show 3 - how can I tell which one causes the single core usage ?

balaji.ramaswamy · January 11, 2020, 9:22pm

@mfarmer

There are 4 scans

JDBC scan for F_STAY always single threaded
RAW reflection on snowflake.KLDW.PUBLIC.D_DATE has just 1 split so single threaded
RAW reflection on KLDW.LOOKUP.BRAND_COMP_SET_LOOKUP has just 1 split so single threaded
RAW reflection on snowflake.KLDW.PUBLIC.D_PROPERTY has 23 splits so can have parallel threads but unfortunately has only 45K records and we spin up the next thread only at 100K

I just had a closer look on why your query is slow as initially you wanted to know the reason for single threaded. Your query is slow not because of the 3 reflections (#2, #3 and #4) but because of the JDBC scan #1 which is scanning 3,032,397,184 records.

If you run this SQL on the source directly, how much does it take. Preferably run it from a client installed on the Dremio Coordinator

SELECT “OCCUPANCY_DT_ID”, “ROOM_COUNT”, “KL_TA_CODE”, “KL_TA_CATEGORY_ID”, “KL_PROP_ID”, “KL_RATE_ID”, “KL_SOB_ID”, “KL_CHANNEL_COST_AMT”, “KL_TRANSACTION_COST_AMT”, “KL_AMENITY_COST_AMT”, “KL_LOYALTY_COST_AMT”, “KL_GUEST_PAID_REVENUE_AMT”, “KL_HOTEL_COLLECTED_REVENUE_AMT”, “KL_COPE_REVENUE_AMT”, “KL_LOYALTY_FLAG”, “KL_NO_SHOW_FLAG”, “KL_CANCELLATION_FLAG”
FROM “KLDW”.“PUBLIC”.“F_STAY”) AS “F_STAY”
WHERE “KL_SOB_ID” IS NOT NULL

mfarmer · January 11, 2020, 10:15pm

Thanks for the response. The full query direct on snowflake takes around 2.25 hrs - about an hour less then dremio.

The jdbc portion on snowflake of the dremio query takes just under 18 mins to complete.

Martin

mfarmer · January 11, 2020, 10:23pm

Here’s the history from snowflake (for the portion that dremio sent it) - Note this is using a SMALL warehouse

I

balaji.ramaswamy · January 12, 2020, 2:48am

@mfarmer

How often does the JDBC data change? Can you create a reflection for that dataset and keep at one day refresh?

mfarmer · January 12, 2020, 10:15am

It changes multiple times a day for three weeks a month.

I tried a reflection but gave in after 12 hrs of execution.

However an “unload” from snowflake only took 2 hrs and the reflection on that was pretty quick so that may be a way forward

Topic		Replies	Views
Iceberg query performance with many parquet files Dremio University	12	1366	July 22, 2023
Exceptionally poor response time ( > 700 x slower) compared to underlying database	9	2082	December 16, 2019
Understanding how the parquet reader works and a suggestion	10	3457	May 23, 2018
Different results from using the same join query with reflection vs without reflection	3	918	September 3, 2021
Reflection used only when sql splitted in subqueries	12	2159	November 9, 2018

Dremio uses one core for processing

Related topics