Executor does not connect to Coordinator

allan.sene · September 4, 2017, 8:05pm

Hi, guys!

I’ve just installed another node to be an Executor but it is not been recognized by the Coordinator. Both dremio.conf are pointing to my MASTER_INTERNAL_IP. I’m obtaining this message on server.log from the Executor:

2017-09-04 18:21:00,359 [main] INFO c.d.d.s.exec.MasterStatusListener - Waiting for master <MASTER_INTERNAL_IP>:45678

My Master is logging something like this:

2017-09-04 19:37:31,092 [FABRIC-6] INFO c.d.services.fabric.FabricServer - [FABRIC]: Channel closed /<MASTER_INTERNAL_IP>:45678 <--> /<SLAVE_INTERNAL_IP>:43408 (control server)

but the Slave port is changing randomly sometimes (to 4340*)

netstat -tunaup is giving me

tcp6 0 0 SLAVE_INTERNAL_IP:41548 MASTER_INTERNAL_IP:2181 ESTABLISHED 1536/java

I’m using EC2 on Amazon. I’ve already opened these ports for connections on both nodes, but the Executor still not been visible.

What am I missing?

can · September 5, 2017, 7:02pm

@allan.sene could you give this a shot with fully qualified host names that are resolvable on all nodes? Also just in case, here is the list of ports needed by Dremio.

allan.sene · September 5, 2017, 9:06pm

Thanks for replying!

I can telnet all those ports from my executor node to the coordinator. Actually, as netstat -tunaup shows, both are connected (slave dremio <-> master zookeeper), but with no send/received signals. =/

Here goes my Security Group configuration:

can · September 5, 2017, 9:48pm

All nodes need to be able to allow communication on 42678 and 2181 to all other nodes. That could be the issue here. Also, have you tried using FQDNs instead of IPs?

Also, looks like the network requirements on the docs are not clear regarding inbound vs. outbound needs, we’ll take a look.

allan.sene · September 6, 2017, 5:59pm

Yes, I’m using FQDN to point the Master… Both machines are reachable from each other (via ping <hostname>). I’m using the same SG to all Dremio nodes, but the Dremio Executor’s service does even open connections on those ports, unlike the master:

 ubuntu@dremio-executor-1:~$ sudo netstat -lptu
 Active Internet connections (only servers)
 Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
 tcp        0      0 *:ssh                   *:*                     LISTEN      1243/sshd
 tcp6       0      0 [::]:ssh                [::]:*                  LISTEN      1243/sshd
 udp        0      0 *:bootpc                *:*                                 1109/dhclient

On Master:

ubuntu@dremio-master-1:~$ sudo netstat -lptu
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 *:ssh                   *:*                     LISTEN      1287/sshd
tcp6       0      0 [::]:31010              [::]:*                  LISTEN      16942/java
tcp6       0      0 [::]:2181               [::]:*                  LISTEN      16942/java
tcp6       0      0 [::]:45678              [::]:*                  LISTEN      16942/java
tcp6       0      0 [::]:ssh                [::]:*                  LISTEN      1287/sshd
tcp6       0      0 [::]:9047               [::]:*                  LISTEN      16942/java
udp        0      0 *:bootpc                *:*                                 1108/dhclient

If I restart the Dremio service on Master, the executor logs the disconnection and connect again! very strange :

2017-09-06 17:38:44,260 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (dremio-master-1:2181) and timeout (5000) / elapsed (54009)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225) [curator-client-2.12.0.jar:na]
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94) [curator-client-2.12.0.jar:na]
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117) [curator-client-2.12.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835) [curator-framework-2.12.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809) [curator-framework-2.12.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64) [curator-framework-2.12.0.jar:na]
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267) [curator-framework-2.12.0.jar:na]
	at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180) [na:1.8.0_131]
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293) [na:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_131]
2017-09-06 17:39:09,118 [Curator-ConnectionStateManager-0] INFO  c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
2017-09-06 17:39:35,095 [main] INFO  c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678
2017-09-06 17:40:35,097 [main] INFO  c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678
2017-09-06 17:41:35,100 [main] INFO  c.d.d.s.exec.MasterStatusListener - Waiting for master dremio-master-1:45678

I will try to redo the installation on another EC2 instance… maybe I screwed something…

Thanks!

allan.sene · September 11, 2017, 6:58pm

Any idea, guys? I’ve already tried changing the master to slave and install on another clean EC2 instance and nothing works

I’m always installing using the TAR Ball. Do you think will change anything if I use the RPM package or provision with YARN?

yufeldman · September 11, 2017, 9:32pm

Could you look at: EC2 installation
And see if that might help you solving your issue

allan.sene · September 11, 2017, 10:25pm

Actually, I’ve tried some things punctuated on this thread, @yufeldman, like use FQDNs, and configure the same Inbound/Outbound Rules, as I showed above… I don’t know what I am missing

A thing that differs from my case is the EC2 instance image: mine is Ubuntu. I’m gonna try using Amazon Linux and install with RPM files. Maybe the tar build is broken or something…

yufeldman · September 11, 2017, 11:20pm

Can you paste here dremio.conf from both executor and coordinator?

allan.sene · September 12, 2017, 12:05am

FINALLY! Guys… what a epic! lol

Seems that: or the TAR build is broken or Dremio’s clusters are only functional with Amazon Linux on EC2

Anyway, I think that will be a valuable shot, anyone tries to simulate this problem: Install a cluster with the TAR on Ubuntu servers.

Thank you so much @yufeldman and @can for caring

Topic		Replies	Views
Cannnot connect executor node to master coordinator	4	2713	November 21, 2017
How to change executor port	2	1769	September 18, 2018
EC2 installation	16	3655	August 8, 2017
Set Up a Cluster on EC2	7	1845	October 29, 2018
Coordinator and Executor Communication Issue	1	1119	November 16, 2017

Executor does not connect to Coordinator

Related topics