Jobs are not shared on executor nodes

ealawoe · November 21, 2018, 2:13pm

Hi there,

We are facing a little problem with cluster deployment and the community edition.

We have set up 3 nodes on AWS: 1 coordinator and 2 executors (m5d.4xlarge)

In the node activity panel they are recognized as you can see below

However, when we want to run a job, only one node will take the job (sorry, I can’t upload more than one picture…)

Note: The two nodes are able to take a job, but they never share a job

Is there something we missed in the configuration of the cluster ?

anthony · November 21, 2018, 2:27pm

If you can share a query profile, we can see exactly what node is used for what and how much. Additionally, maybe the jobs were so trivial they didn’t require the horsepower of 2 full nodes (like regular queries)? Typically reflection jobs are resource intensive, maybe you can test one of those to see if it uses both?

ealawoe · November 21, 2018, 2:45pm

Hi Anthony and thanks for your quick reply

We tried both. It’s mainly for Data Reflection that we expect Dremio to use several nodes but it’s not,
How do you want me to share you the query profile ?

anthony · November 21, 2018, 3:07pm

You can attach the zip onto the community - https://www.dremio.com/tutorials/share-query-profile-dremio/

ealawoe · November 21, 2018, 3:14pm

Here you go da54bc50-2f16-425f-a2f3-2898bc723e2a.zip (27.4 KB)

anthony · November 21, 2018, 4:08pm

It does seem like only 1 executor node (172-31-32-56.eu-west-1.compute.internal) and only 1 thread was used.

Can you share the specs of each node along with the dremio.conf? Are these shared or dedicated nodes? Some things seem a bit off, 7min planning time, and another 22min to run, is this a large table?

ealawoe · November 21, 2018, 4:25pm

This query was on a table which contains around 8 millions row which is obviously not the biggest…

The machines are shared hardware instances, here are the spec & conf:

Coordinator node: t2.2xlarge

paths: {
  # the local path for dremio to store data.
  local: "/var/lib/dremio"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
  dist: "s3a://xxxxxxx/dremio"
}

services: {
  coordinator.enabled: true,
  coordinator.master.enabled: true,
  coordinator.master.embedded-zookeeper.enabled: true,
  executor.enabled: false
}

zookeeper: "localhost:"${services.coordinator.master.embedded-zookeeper.port}
zk.client.session.timeout: 1800000

Executor nodes: 2 x m5d.4xlarge

Configuration:

paths: {
  # the local path for dremio to store data.
  local: "/var/lib/dremio"

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
  dist: "s3a://xxxxxxx/dremio"
}

services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true
}

zookeeper:"172.31.45.54:2181"
zk.client.session.timeout: 1800000

anthony · November 21, 2018, 8:47pm

Configuration seems correct. Although the full table is ~8mil, what was the result set of the query? I see 407 records?

1 thing I just noticed is it seems you are using a RDBMS and Dremio is pushing down the query. The retrieval of that query from the source may be the hang up here as all Dremio processing after the data retrival is done within seconds (hence low executor usage), while waiting for the data from source is taking ~22min

ealawoe · November 21, 2018, 9:14pm

So using a RDBMS won’t allow us to use several nodes even for data reflection ?

anthony · November 23, 2018, 3:11pm

Using a RDBMS will definitely allow you to still use a distributed network of nodes, I think the point is you may not need to. The particular situation you are facing is 1. A pushdown is being used to leverage the underlying system and then 2. The resulting dataset is so small it can be processed so quickly in Dremio it doesn’t need to use much resources

ealawoe · November 26, 2018, 8:29am

Okay got it.
But the thing is when we try with bigger tables, we often face have a lost of connectivity of the working node.

Here is another example:
8ecefba6-aa93-4165-aa0f-90341b55b36c.zip (13.8 KB)

We thought that this query is big enough to use the distributed network of nodes but instead, but still one node is used

anthony · November 26, 2018, 3:00pm

Behavior is a bit different from this one. I see ip-172-31-43-207.eu-west-1.compute.internal crashed, possibly due to OOM. Can you check the logs of that node please?

ealawoe · November 27, 2018, 8:37am

Indeed, I do find this in logs

java.lang.OutOfMemoryError: GC overhead limit exceeded
Dumping heap to /var/log/dremio/java_pid8936.hprof ...
Heap dump file created [4037560839 bytes in 14.221 secs]
Exception in thread "thread-stats-collector" java.lang.OutOfMemoryError: GC overhead limit exceeded
at sun.management.ThreadImpl.getThreadUserTime(ThreadImpl.java:293)
at sun.management.ThreadImpl.getThreadUserTime(ThreadImpl.java:285)
at com.dremio.sabot.exec.ThreadsStatsCollector.addUserTime(ThreadsStatsCollector.java:77)
at com.dremio.sabot.exec.ThreadsStatsCollector.run(ThreadsStatsCollector.java:54)

eduardoslopes · May 28, 2020, 6:24pm

I have the same problem when I try to create a big dataset.

balaji.ramaswamy · June 2, 2020, 3:28am

@eduardoslopes

What version of Dremio are you on?

Thanks
Bali

Topic		Replies	Views
Distributing query workload on remote executor nodes	0	794	March 12, 2021
Cluster configuration and requirements	1	989	June 20, 2019
How to scale up dremio cluster?	4	594	September 5, 2023
Multiple executors on same node	14	3218	November 27, 2019
Sharing Cluster Metadata on S3	5	1960	March 27, 2019

Jobs are not shared on executor nodes

Related topics