@balaji.ramaswamy thanks so much for this!
I tried configuring Zookeeper, but I keep getting Fabric connection errors when I try to bring up the executor (see sample error messages below).
Here’s dremio.conf for the master-coordinator:
paths: {
# the local path for dremio to store data.
local: ${DREMIO_HOME}"/data",
# the distributed path Dremio data including job results, downloads, uploads, etc
#dist: "pdfs://"${paths.local}"/pdfs"
dist: "dremioS3:///dremio-internal"
}
services: {
coordinator:{
enabled: true,
# Auto-upgrade Dremio at startup if needed
auto-upgrade: false,
master: {
enabled: true,
# configure an embedded ZooKeeper server on the same node as master
embedded-zookeeper: {
enabled: true,
port: 2181,
path: ${paths.local}/zk
}
},
}
fabric: {
port: 45678,
memory: {
reservation: 100M
}
},
# coordinator.master.enabled: true,
#executor.enabled: true,
flight.enabled: true,
flight.ssl.enabled: false
# flight.use_session_service: true
}
zookeeper: "192.10.0.76:2181"
zk.client.session.timeout: 90000
The master-coordinator comes up fine and I can connect to it as a single node.
Here is dremio.conf for the executor:
paths: {
# the local path for dremio to store data.
local: ${DREMIO_HOME}"/data",
# the distributed path Dremio data including job results, downloads, uploads, etc
#dist: "pdfs://"${paths.local}"/pdfs"
dist: "dremioS3:///dremio-internal"
}
services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true,
flight.enabled: false,
flight.ssl.enabled: false
}
zookeeper: "192.10.0.76:2181"
zk.client.session.timeout: 90000
Here are the error messages that occur when I try to bring up the executor
dremio-executor-1 | 2023-12-26 22:30:41,241 [FABRIC-4] INFO c.d.services.fabric.FabricClient - [FABRIC]: Channel closed null <--> null (fabric client)
dremio-executor-1 | 2023-12-26 22:30:41,243 [FABRIC-4] ERROR com.dremio.exec.rpc.BasicClient - Failed to establish connection
dremio-executor-1 | java.util.concurrent.ExecutionException: java.net.UnknownHostException: bcd2ae315e7e: Temporary failure in name resolution
Dremio tries several times and then exits up with error code 4.
Surprisingly, it does look like Zookeeper gets a connection to the master-coordinator:
dremio-executor-1 | 2023-12-26 22:40:53,721 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Connect: 192.10.0.76:2181, zkRoot: , clusterId: dremio
dremio-executor-1 | 2023-12-26 22:40:53,840 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Creating new Zookeeper client with arguments: 192.10.0.76:2181, 90000, false.
dremio-executor-1 | 2023-12-26 22:40:53,877 [main] INFO c.d.s.coordinator.zk.ZKClusterClient - Starting ZKClusterClient, ZK_TIMEOUT:5000, ZK_SESSION_TIMEOUT:90000, ZK_RETRY_MAX_DELAY:300000, ZK_RETRY_UNLIMITED:true, ZK_RETRY_LIMIT:-1, CONNECTION_HANDLE_ENABLED:false, SUPERVISOR_INTERVAL:30000, SUPERVISOR_READ_TIMEOUT:10000, SUPERVISOR_MAX_FAILURES:5
dremio-executor-1 | 2023-12-26 22:40:53,906 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZKClusterClient: new state received[CONNECTED] - isConnected: true
dremio-executor-1 | 2023-12-26 22:40:54,104 [main] INFO c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service coordinator and role COORDINATOR
dremio-executor-1 | 2023-12-26 22:40:54,114 [main] INFO c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service executor and role EXECUTOR
dremio-executor-1 | 2023-12-26 22:40:54,125 [main] INFO c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service master and role MASTER
dremio-executor-1 | 2023-12-26 22:40:54,125 [main] INFO c.d.s.c.zk.ZKClusterCoordinator - ZKClusterCoordination is up
dremio-executor-1 | 2023-12-26 22:40:54,125 [main] INFO c.d.s.c.TaskLeaderStatusListener - Starting TaskLeaderStatusListener for: MASTER
dremio-executor-1 | 2023-12-26 22:40:54,126 [main] INFO c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER cec009d49b33:45678 registered itself.
dremio-executor-1 | 2023-12-26 22:40:54,126 [main] INFO c.d.s.c.TaskLeaderStatusListener - TaskLeaderStatusListener for: MASTER is up
From the executor I’m able to successfully connect to any of the open ports on the master.
I can’t figure out what the RPC errors are about - I don’t see anything about explicitly enabling ports or services for RPC.
The only other thing I can think of is that within the Node Activity page I see that master node is reporting an “internal” address assigned by Docker rather than its external address of 192.10.0.76.
Is the master try to connect back out to the executor?
Any advice would be much appreciated.
Thanks!
Eric