Only one executor is working in Cluster

My master is running and I am able to connect only one executor . Other executors are throwing the following error

Catastrophic failure occurred. Exiting. Information follows: Failed to start services, daemon exiting.
com.dremio.common.exceptions.UserException: The source ["__jobResultsStore"] is currently unavailable. Info: [[Message{level=ERROR, msg=Failure to create directory /data/DremioData/pdfs/results.}]].

My path setting in dremio.conf is
paths: {

the local path for dremio to store data.

local: “/data/DremioData/”

If I just stop Executor 1 and Start executor 2 with same setting , it is working . But if I start Executor 1 now, it starts throwing same error.

Did you check if you have right permissions on /data/DremioData directory so user that starts Dremio can write to it?

yes, I have full permission and its working if I am running first executor. Its always the 2nd executor which fails. If I change the sequence then again the last executor fails with the same error

Last or second? Did you try three executors?

3 nodes , first one is master and coordinator
2nd is first executor
3rd is 2nd executor
So I am able to connect only 1 executor ( irrespective of sequence)

Do you think path:local can be reason behind it as this setting Paths:local doesn’t exist in executor
executor config

services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true
}

zookeeper: “xx.xxx.xx.xx:2181”

Your coordinator/master and executor dremio.conf should be pretty much the same except roles such as:

How many executors have you tested in your environment for single master ?

We support as many as needed.
The reason I was asking about three is to see whether it is second executor that is failing for you or last one. When you have two executors second == last, so hard to differentiate

ok, so actually only one executor is working irrespective os sequence of starting ( of executors)
do I need to use paths: dist instead of local ?

By default paths.dist is a derivative of paths.local.
Like: pdfs://"${paths.local}"/pdfs

What’s your dremio.conf on both executors?

same setting on both executors
paths: {

the local path for dremio to store data.

local: “/data/DremioData/”

the distributed path Dremio data including job results, downloads, uploads, etc

dist: “pdfs://”${paths.local}"/pdfs"
}

services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true
}

zookeeper: “xx.xxx.236.41:2181”

It is really very strange. Could you look at:

  1. server.out on failing executor node
  2. server.log on master coordinator

To see if there are more clues there.

1 Like

great , as you mentioned to check executor logs , i found that executors were not able to see each other due to some dns issue. Error in exector 1 log was
java.net.UnknownHostException: executor2

There is some issue with DNS , so I updated /etc/hosts of all nodes ( master and all executors) and added all IPs and then restarted Master and executors one by one . Now all the executors are working fine and visible in Dremio UI
Thanks a lot for all the support