I’ve just installed another node to be an Executor but it is not been recognized by the Coordinator. Both dremio.conf are pointing to my MASTER_INTERNAL_IP. I’m obtaining this message on server.log from the Executor:
2017-09-04 18:21:00,359 [main] INFO c.d.d.s.exec.MasterStatusListener - Waiting for master <MASTER_INTERNAL_IP>:45678
@allan.sene could you give this a shot with fully qualified host names that are resolvable on all nodes? Also just in case, here is the list of ports needed by Dremio.
I can telnet all those ports from my executor node to the coordinator. Actually, as netstat -tunaup shows, both are connected (slave dremio <-> master zookeeper), but with no send/received signals. =/
All nodes need to be able to allow communication on 42678 and 2181 to all other nodes. That could be the issue here. Also, have you tried using FQDNs instead of IPs?
Also, looks like the network requirements on the docs are not clear regarding inbound vs. outbound needs, we’ll take a look.
Yes, I’m using FQDN to point the Master… Both machines are reachable from each other (via ping <hostname>). I’m using the same SG to all Dremio nodes, but the Dremio Executor’s service does even open connections on those ports, unlike the master:
ubuntu@dremio-executor-1:~$ sudo netstat -lptu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 *:ssh *:* LISTEN 1243/sshd
tcp6 0 0 [::]:ssh [::]:* LISTEN 1243/sshd
udp 0 0 *:bootpc *:* 1109/dhclient
If I restart the Dremio service on Master, the executor logs the disconnection and connect again! very strange :
2017-09-06 17:38:44,260 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (dremio-master-1:2181) and timeout (5000) / elapsed (54009)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [curator-client-2.12.0.jar:na]
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [curator-client-2.12.0.jar:na]
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [curator-client-2.12.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [curator-framework-2.12.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.12.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.12.0.jar:na]
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.12.0.jar:na]
at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_131]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_131]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
2017-09-06 17:39:09,118 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
2017-09-06 17:39:35,095 [main] INFO c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678
2017-09-06 17:40:35,097 [main] INFO c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678
2017-09-06 17:41:35,100 [main] INFO c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678
I will try to redo the installation on another EC2 instance… maybe I screwed something…
Actually, I’ve tried some things punctuated on this thread, @yufeldman, like use FQDNs, and configure the same Inbound/Outbound Rules, as I showed above… I don’t know what I am missing
A thing that differs from my case is the EC2 instance image: mine is Ubuntu. I’m gonna try using Amazon Linux and install with RPM files. Maybe the tar build is broken or something…