Query Workload Distribution and Cost Calculation in Dremio

JoiceJacob · January 8, 2025, 3:24am

Hi Team,

I have recently added a worker node to our Dremio setup, making the configuration two worker nodes and one coordinator node. Currently, the setup is as follows:

Server 1: Worker and Coordinator
Server 2: Worker

When executing a query, I expected the workload to be distributed across both worker nodes. However, upon reviewing the query profile, it seems the workload is being processed by only one worker. I have attached the query profile for your reference. Could you help me understand why the workload is not being distributed across both workers?
229561ab-ed25-412f-946c-a7d77bb7c9e3.zip (13.5 KB)

Additionally, I understand that the classification of queries as large or small is based on their cost, with queries exceeding the query threshold considered large. Could you provide insights into how the query cost is calculated? Is there a specific benchmark or formula used for this calculation, and how can I determine the appropriate threshold for identifying large queries?

Looking forward to your guidance.

Thanks in Advance

JoiceJacob · January 9, 2025, 4:55am

Hi team,

Can you please update on above points ?

Thanks in Advance

Benny_Chow · January 9, 2025, 9:41pm

Hi Joice, this query uses a JDBC source which only supports single threaded reads. If you had a source with many data “splits” like an Iceberg table, then the reads can be parallelized across your two node executor cluster.

Query cost is related to row count of each of the physical operators.

JoiceJacob · January 17, 2025, 4:50am

Hi @Benny_Chow

Thanks for your response.

For simple queries like SELECT with aggregate functions or SELECT *, we often observe that the row count in the physical plan matches the actual number of rows when the query is executed. However, when using WHERE, GROUP BY, or JOIN clauses, the row count displayed in the physical plan can differ from the actual number of rows produced by the query.

How is the row count calculated at each step of the physical plan, and is there a specific formula or methodology behind it?

Thanks in advance

JoiceJacob · January 20, 2025, 4:24am

Hi @Benny_Chow

Can you please provide me an update on this

Thanks in Advance

Benny_Chow · January 20, 2025, 6:34am

Hi @JoiceJacob

Please see the bottom note on this page:

The basic idea behind this query cost is that it should reflect the cost used for parallelization. If the query scans many tables or a table with many splits, it would be beneficial to parallelize this scan operation. A query with higher cost should get an engine cluster with a higher total CPU count.

If you are looking at the verbose query profiles, you don’t want to get this query cost confused with the logical and physical planning costs where there’s a totally different cost model that considers cpu, io, memory and network for various query trees explored by the CBO.

JoiceJacob · January 21, 2025, 7:43pm

Hi @Benny_Chow

Thanks for your response,

Based on my understanding, the number of output records from a query result corresponds to the rowcount in the physical plan. Please correct me if I’m mistaken.

For instance, when executing a query like SELECT * FROM table1, if the query result contains 20 rows, the rowcount in the physical plan also shows 20. However, when running a query such as SELECT name, SUM(amount) FROM table GROUP BY name, the query result contains 6 rows, but the rowcount in the physical plan is 8. Additionally, I noticed cases where the query result with a WHERE condition contains only 1 row, but the rowcount in the query profile is shown as 1.5.

Can you explain the reasons for these differences?

Thanks in Advance

JoiceJacob · January 24, 2025, 4:44am

Hi @Benny_Chow

Can you please provide me an update on this

Thanks in Advance

Benny_Chow · January 25, 2025, 3:52am

the number of output records from a query result corresponds to the rowcount in the physical plan.

This is true only if you are referring to the Screen operator.

the query result contains 6 rows, but the rowcount in the physical plan is 8.

The row count in the physical plan is just an estimate. It’s not surprising that the actual is 6.

Topic		Replies	Views
Distributing query workload on remote executor nodes	0	794	March 12, 2021
How to scale up dremio cluster?	4	592	September 5, 2023
Dremio UI (Run) not scalling while query execution	8	82	March 11, 2025
Dremio unbalance worker load	5	977	December 16, 2020
Number of Threads	11	2161	January 18, 2022

Query Workload Distribution and Cost Calculation in Dremio

Related topics