Hi,
I have Dremio deployed to an EKS cluster and have a couple question:
- Is there a way to configure the number of threads used?
- Any best practices on this configuration?
- What is the default number of threads used?
Thanks
Hi,
I have Dremio deployed to an EKS cluster and have a couple question:
Thanks
When you say threads, is it for the scans? # of threads depends on 3 factors
We should leave it to Dremio to decide, are you facing a performance issue? If yes, kindly share the query profile
My cluster is set up with 2 coordinators and 3 executors. My use case is small, low latency queries at a high frequency. At a moderate load I was seeing good performance and low resource utilization (CPU and Memory). After increasing to a higher load, Iām seeing degraded performance, but poor resource utilization.
Moderate Load Query Profiles:
a5476b14-1dff-46be-ad3a-b1c1f56ba912.zip (14.0 KB)
High Load Query Profiles:
07bd3b77-cb6f-4168-a7f2-0d398426cbf5.zip (28.1 KB)
High Load CPU/Memory:
Additional Query Profiles:
Moderate Load Query Profiles:
544ace0c-e972-47ae-b691-a9e32d11106d.zip (12.1 KB)
High Load Query Profiles:
084ce920-1649-4333-9fd2-0bd943b2c7fe.zip (13.9 KB)
The queries are completing in 135ms, what is your expectation?
2min.zip (2.5 MB)
@balaji.ramaswamy Can you help me optimize this query? I donāt understand the sqlprofile, Please! TKS!
@alex.shi Here is where the time is spent
01-xx-01 ARROW_WRITER 8s wait time on IO, are you writing results to local disk or a distributed storage like S3/HDFS? When writing to a distributed storage, this wait time is sometimes expected
01-xx-03 PROJECT has a SETUP time of 27s, this is due to the Gandiva expression getting evaluated, if you re run the query, it should go away. If it is still there then we can look at increasing the cache size but for that we need to make sure we have enough memory from the OS available
02-xx-00 HASH_PARTITION_SENDER and 02-xx-01 PROJECT ~ 34s on SETUP, the PROJECT is again due to the above Gandiva expression build time
There a few FILTER (like 06-xx-03 FILTER) operators that have taken ~ 5s each on SEUP for the same Gandiva setup time,
Phases 1,2 and 14 are high on parallelism and this cause some CPU contention as shown under sleep time, are there other queries running at the same time?
Thank you for your feedback, and NOW I will answer your questions:
I write results to HDFS.
I have one coordination node and nineteen execution nodes.
The Server physical memory of the coordination node is 125GB, and I set DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=110000, DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 80000.
The Server physical memory of the execution node is 256GB. and I set DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=120000, DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 60000.
I disabled swap, but when I run the top command to check the memory, I found that the physical memory usage was small and the virtual memory usage was large. Physical memory usage is low
There is only one query at a time. I donāt know why is the concurrency high.
I look forward to your answers!@balaji.ramaswamy
I have to say that the lack of feedback/very slow response from Dremio.
I found the sqlprofile really hard to understand and not well documented.
Apologies for the delay, was there a reason to have such a high heap memory? Dremio execution all happens on direct memory and hence on the executor you can set the below values
comment out below 2,
#DREMIO_MAX_MEMORY_SIZE_MB=
#DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=
enable below 2
DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 8192
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=112640 # we can increase this later if required
The coordinator is heap intensive, so the below values,
comment out below 2,
#DREMIO_MAX_MEMORY_SIZE_MB=
#DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=
enable below 2
DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 16384
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=16384
If the coordinator or executors runs out of heap or has long GC pauses, we can enable histograms to find out why and then take next steps
Thank you for your feedbackļ¼ļ¼
I reconfigured according to this parameter, and the performance is no better.
Iād like to ask you one more questionļ¼since HDFS input is slow, I have disabled the āenable local caching for hdfsā option to see if there is a way to specify caching to local disk instead of HDFS.Maybe thatāll give me a little bit of a performance boost.@balaji.ramaswamy
Caching should help with the wait times incurred