I’m running a Dremio cluster with separate master and executor node (1 for master and 1 for executor). The data directory configuration (in dremio.conf) is identical for both, pointing to /opt/dremio-data/metadata/data, and the subfolders present on executor include pdfs, cm, db, spill, and security.
My observation is that, during execution of large queries, Dremio exhibits the following resource usage pattern:
-
RAM fills up first while CPU utilization remains low.
-
Only after RAM is saturated and spilling to disk begins (visible in the spill directory), CPU utilization significantly increases.
-
IOPS also spike at the moment the system switches to spill mode.
-
The cm and db folders (on executor) only show KB-size increments wile executing queries, consistent with cache metadata changes.
-
I have confirmed the spill behavior aligns with Hash Aggregation Spilling | Dremio Documentation.
-
In the executor’s dremio.conf, I add separate cache folder for fs and db, but still CPU isnt utilized before RAM filled, even without specifying separate folders for cache, same results are observed.
-
-
Another observation is, doesnt matter dremio is in single node or in a cluster, still CPU not utilized before RAM filled, and size of spill increases only when RAM there is not enough space in RAM
-
I tried with different set of queries and data, still CPU not utilized,
-
My goal - I want to utilize CPU as much as possible irrespective of RAM is being filled or not, I know if i ran simple query CPU wont be utilized, trust me, my queries are very complex, my data sets are 17, 12 GB’s and my queries include multiple aggregate functions
-
This is my dremio-data path.
dremio@Dremio-Executor-01:/opt/dremio-data/metadata/data$ du -h --max-depth=1
1.7G ./pdfs
219M ./cm
16K ./spill
106M ./db
20K ./security
2.0G .
-
Is there could be CPU problem, I use Azure VM, https://instances.vantage.sh/azure/vm/f4s-v2?currency=USD - dremio-executor
https://instances.vantage.sh/azure/vm/d2s-v3?currency=USD - dremio-master, Will there be any difference due to throughput time?
Configurations
RAM - Executor (2 -heap, 5- Direct)
Cores - 4
@Harish-AK CPU and memory consumption are unrelated, it might be possible your query has memory/IO intensive operators like Scan. When memory fills, some operators like join, agg and sort spill to disk. CPU intensive operators are Joins, Aggs etc
C3 cache (fs and db folders) are to avoid scans going to the object store so I/O will be quick, it might be possible you are having 100% C3 hits which in turn frees up the slicing thread (cores) much sooner and hence your CPU utilization is less
Would you be able to send me the job profile of the job that was running when you monitored memory, CPU and disk? This will explain all your questions
I tested with same queries with AWS instance, there i got different results.
Here CPU hit 100 % even if RAM isnt filled.
3 Scripts
RAM - 8 GB (Heap - 2 GB, direct - 5 GB)
data - 40 GB
Cores - 4
CPU - c6a.xlarge pricing and specs - Vantage
AWS instance graph (execution started at 15:17)
With azure instance (same queries and data),
observation (10th sep 2025)
Dremio Query Profiles
Query 1
Query 2
Query 3
I cancelled the queries manually,
If you require any additional information, please let me know and I will be happy to provide it.
@Harish-AK RAM - 8 GB (Heap - 2 GB, direct - 5 GB) could be undersized based on the query plan, again need the job profile to investigate why CPU or memory is used
Job profile for the queries,
Query 1
Query 3
Then I ran different set of queries to stress the CPU , queries include (hashing, math functions, string manipulation, case logic). still no difference, CPU not utilized
@Harish-AK These are not job profiles
To download a Dremio job profile, follow these steps:
-
Navigate to the Jobs page:
.In the Dremio UI, click on the “Jobs” icon in the side navigation bar.
-
Select the desired job:
On the Jobs page, locate and click on the specific job for which you want to download the profile. This will open the Job Details page.
-
Download the profile:
.At the bottom-left corner of the Job Details page, you will find a “Download Profile” button. Click this button.
Send us the zip file without extracting
Here the job profiles for the queries i ran
Query 1.zip (29.7 KB)
Query 3.zip (58.1 KB)
Query 2.zip (24.1 KB)
@Harish-AK
I analyzed job ID# 173c543b-9fc8-0c1a-cc92-6a5c97de5f00
Here are the reason for slowness, CPU utilization. The 2 TEXT_SUB_SCAN 05-xx-02 and 04-xx-02 are I/O bound (not much of CPU) as a single thread needs to reads the 65 million records, self join
But even thought the Hash_join says INNER (03-07) we can see it is an expanding join, see PROJECT 03-xx-06, 20 million records to a INNER JOIN explodes to 300 million. Can you please check if the join conditions are correct?
The queries doesnt matter, I just want to utilize the maximum CPU, I can see TEXT_SUB_SCAN, PROJECT and HASH_JOIN’s wait time is long and max records are huge. Is it due to the kind of queries i execute?
- Can you tell me what kind of queries utilize maximum CPU.
- Should I reduce my table size to load more work on CPU, on avg my table size is 12 GB.
- My goal is to stress the CPU as much as possible.
@Harish-AK The wait times are all I/O, as I said the HASH_JOIN is an expanding one that is spilling to disk and causing heavy I/O wait times
Consider
- Moving to Iceberg format
- Check why the join is expnading