[Fixed] Unable to start worker nodes

I setup a 4 node (2 co-ordinator 1 master, 1 non master on md5.2xl; and 2 executor on md5.4xl) dremio cluster with embedded zk, which was up and running fine. I tried stopping and starting the dremio daemon and now I’m getting the following error:

Exception in thread "main" java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at java.util.concurrent.ConcurrentHashMap.containsKey(ConcurrentHashMap.java:964)
at com.dremio.datastore.ReIndexer.isIndexed(ReIndexer.java:85)
at com.dremio.datastore.ReIndexer.put(ReIndexer.java:55)
at com.dremio.datastore.ReplayHandlerAdapter.put(ReplayHandlerAdapter.java:51)
at org.rocksdb.WriteBatch.iterate(Native Method)
at org.rocksdb.WriteBatch.iterate(WriteBatch.java:61)
at com.dremio.datastore.ByteStoreManager.replaySince(ByteStoreManager.java:295)
at com.dremio.datastore.ByteStoreManager.replayDelta(ByteStoreManager.java:280)
at com.dremio.datastore.CoreStoreProviderImpl.reIndexDelta(CoreStoreProviderImpl.java:256)
at com.dremio.datastore.CoreStoreProviderImpl.recoverIfPreviouslyCrashed(CoreStoreProviderImpl.java:206)
at com.dremio.datastore.LocalKVStoreProvider.start(LocalKVStoreProvider.java:131)
at com.dremio.dac.cmd.upgrade.Upgrade.run(Upgrade.java:177)
at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:104)

The master node comes up with the above mentioned error message. It finally prints out the log Dremio Daemon Started as master. But the worker nodes (the non-master co-ordinator and the executors) are stuck. Though the dremio daemon process is started in those machines as well.

Worker nodes server.out:

Mon Nov 26 08:22:29 UTC 2018 Starting dremio on ip-x-x-x-x.<aws-az>.compute.internal
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 124675
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 4096
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

And the dremioUI only shows up the master node.
Also seeing the following error in the UI Your Internet connection may be offline, or WebSockets to Dremio are being blocked. and unable to run any query - No executors currently available.
Do you know what could be wrong here?

Can you share the dremio.conf for all nodes? The worker nodes need to point to the master’s IP (to use it’s ZK). Also, are AWS firewall ports open? https://docs.dremio.com/deployment/system-requirements.html#network

confs.zip (6.3 KB)

Opened the following set of ports: 9047, 22, 31010, 45678, 2181 in inbound in the security group too.

Nevermind, the issue was with the VPN. The security group internal communication traffic itself was not whitelisted inside the VPN.

Thanks for reporting back!