Number of Threads

Hi,

I have Dremio deployed to an EKS cluster and have a couple question:

  • Is there a way to configure the number of threads used?
  • Any best practices on this configuration?
  • What is the default number of threads used?

Thanks

@tylercmp

When you say threads, is it for the scans? # of threads depends on 3 factors

  • Number of input splits
  • Number of cores per executor (all executors should have same number of cores)
  • Estimated number of rows on the scan

We should leave it to Dremio to decide, are you facing a performance issue? If yes, kindly share the query profile

My cluster is set up with 2 coordinators and 3 executors. My use case is small, low latency queries at a high frequency. At a moderate load I was seeing good performance and low resource utilization (CPU and Memory). After increasing to a higher load, I’m seeing degraded performance, but poor resource utilization.

Moderate Load Query Profiles:
a5476b14-1dff-46be-ad3a-b1c1f56ba912.zip (14.0 KB)

High Load Query Profiles:
07bd3b77-cb6f-4168-a7f2-0d398426cbf5.zip (28.1 KB)

High Load CPU/Memory:

Additional Query Profiles:

Moderate Load Query Profiles:
544ace0c-e972-47ae-b691-a9e32d11106d.zip (12.1 KB)

High Load Query Profiles:
084ce920-1649-4333-9fd2-0bd943b2c7fe.zip (13.9 KB)

@tylercmp

The queries are completing in 135ms, what is your expectation?

2min.zip (2.5 MB)
@balaji.ramaswamy Can you help me optimize this query? I don’t understand the sqlprofile, Please! TKS!

@alex.shi Here is where the time is spent

01-xx-01 ARROW_WRITER 8s wait time on IO, are you writing results to local disk or a distributed storage like S3/HDFS? When writing to a distributed storage, this wait time is sometimes expected

01-xx-03 PROJECT has a SETUP time of 27s, this is due to the Gandiva expression getting evaluated, if you re run the query, it should go away. If it is still there then we can look at increasing the cache size but for that we need to make sure we have enough memory from the OS available

02-xx-00 HASH_PARTITION_SENDER and 02-xx-01 PROJECT ~ 34s on SETUP, the PROJECT is again due to the above Gandiva expression build time

There a few FILTER (like 06-xx-03 FILTER) operators that have taken ~ 5s each on SEUP for the same Gandiva setup time,

Phases 1,2 and 14 are high on parallelism and this cause some CPU contention as shown under sleep time, are there other queries running at the same time?

1 Like

Thank you for your feedback, and NOW I will answer your questions:

  1. I write results to HDFS.

  2. I have one coordination node and nineteen execution nodes.
    The Server physical memory of the coordination node is 125GB, and I set DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=110000, DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 80000.
    The Server physical memory of the execution node is 256GB. and I set DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=120000, DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 60000.
    I disabled swap, but when I run the top command to check the memory, I found that the physical memory usage was small and the virtual memory usage was large. Physical memory usage is low

  3. There is only one query at a time. I don’t know why is the concurrency high.

I look forward to your answers!@balaji.ramaswamy

I have to say that the lack of feedback/very slow response from Dremio.
I found the sqlprofile really hard to understand and not well documented.

@alex.shi

Apologies for the delay, was there a reason to have such a high heap memory? Dremio execution all happens on direct memory and hence on the executor you can set the below values

comment out below 2,
#DREMIO_MAX_MEMORY_SIZE_MB=
#DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=

enable below 2
DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 8192
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=112640 # we can increase this later if required

The coordinator is heap intensive, so the below values,
comment out below 2,
#DREMIO_MAX_MEMORY_SIZE_MB=
#DREMIO_MAX_PERMGEN_MEMORY_SIZE_MB=

enable below 2
DREMIO_MAX_HEAP_MEMORY_SIZE_MB= 16384
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=16384

If the coordinator or executors runs out of heap or has long GC pauses, we can enable histograms to find out why and then take next steps

Thank you for your feedback!!
I reconfigured according to this parameter, and the performance is no better.
I’d like to ask you one more question,since HDFS input is slow, I have disabled the “enable local caching for hdfs” option to see if there is a way to specify caching to local disk instead of HDFS.Maybe that’ll give me a little bit of a performance boost.@balaji.ramaswamy

Caching should help with the wait times incurred