Docker deployment (no K8s) coordinator +3 executors

eellsworth · December 21, 2023, 1:35am

Hi,
I currently have Dremio deployed on prem with a single coordinator/executor node, using a local Minio instance both for internal storage (1 separate bucket) and for data sources (other buckets).

I want to add 3 more executors running in standalone Docker containers, each on their own machine, and I’m looking for some guidance on how to do this.

Due to the limitations of my network setup (which I don’t control), all the nodes must be referenced by static IP addresses, no DNS.

Here are my questions:

How do I tell the executors which IP to look for the coordinator on?
-Should the executors be configured to use S3 for internal storage? If so, do they share the same bucket as the coordinator or do they need their own isolated buckets?
Does the coordinator need to be reconfigured to look for executors?

I’ve found some documentation but it is old and I’m unclear if it applies to current versions, and I can’t find anything about multiple Docker /non-K8s executors and S3 internal storage.

Any advice would be appreciated.

Thanks!

Eric

balaji.ramaswamy · December 25, 2023, 5:48am

Hi @eellsworth

Queations #1 and #3 happen via Zookeeper. Are you using an embeded ZK, then give that setting on an both coordinator and executor

For question #2, yes executors should be configured with dist:// parameter

eellsworth · December 26, 2023, 10:51pm

@balaji.ramaswamy thanks so much for this!

I tried configuring Zookeeper, but I keep getting Fabric connection errors when I try to bring up the executor (see sample error messages below).

Here’s dremio.conf for the master-coordinator:

paths: {
  # the local path for dremio to store data.
  local: ${DREMIO_HOME}"/data",

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
  dist: "dremioS3:///dremio-internal"
}

services: {
  coordinator:{
    enabled: true,
    # Auto-upgrade Dremio at startup if needed
    auto-upgrade: false,

    master: {
      enabled: true,
      # configure an embedded ZooKeeper server on the same node as master
      embedded-zookeeper: {
        enabled: true,
        port: 2181,
        path: ${paths.local}/zk
      }
    },
  }



  fabric: {
    port: 45678,
    memory: {
      reservation: 100M
    }
  },

#  coordinator.master.enabled: true,
  #executor.enabled: true,
  flight.enabled: true,
  flight.ssl.enabled: false
#  flight.use_session_service: true
}

zookeeper: "192.10.0.76:2181"
zk.client.session.timeout: 90000

The master-coordinator comes up fine and I can connect to it as a single node.

Here is dremio.conf for the executor:

paths: {
  # the local path for dremio to store data.
  local: ${DREMIO_HOME}"/data",

  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
  dist: "dremioS3:///dremio-internal"
}

services: {
  coordinator.enabled: false,
  coordinator.master.enabled: false,
  executor.enabled: true,
  flight.enabled: false,
  flight.ssl.enabled: false
}

zookeeper: "192.10.0.76:2181"
zk.client.session.timeout: 90000

Here are the error messages that occur when I try to bring up the executor

dremio-executor-1  | 2023-12-26 22:30:41,241 [FABRIC-4] INFO  c.d.services.fabric.FabricClient - [FABRIC]: Channel closed null <--> null (fabric client)
dremio-executor-1  | 2023-12-26 22:30:41,243 [FABRIC-4] ERROR com.dremio.exec.rpc.BasicClient - Failed to establish connection
dremio-executor-1  | java.util.concurrent.ExecutionException: java.net.UnknownHostException: bcd2ae315e7e: Temporary failure in name resolution

Dremio tries several times and then exits up with error code 4.

Surprisingly, it does look like Zookeeper gets a connection to the master-coordinator:

dremio-executor-1  | 2023-12-26 22:40:53,721 [main] INFO  c.d.s.coordinator.zk.ZKClusterClient - Connect: 192.10.0.76:2181, zkRoot: , clusterId: dremio
dremio-executor-1  | 2023-12-26 22:40:53,840 [main] INFO  c.d.s.coordinator.zk.ZKClusterClient - Creating new Zookeeper client with arguments: 192.10.0.76:2181, 90000, false.
dremio-executor-1  | 2023-12-26 22:40:53,877 [main] INFO  c.d.s.coordinator.zk.ZKClusterClient - Starting ZKClusterClient, ZK_TIMEOUT:5000, ZK_SESSION_TIMEOUT:90000, ZK_RETRY_MAX_DELAY:300000, ZK_RETRY_UNLIMITED:true, ZK_RETRY_LIMIT:-1, CONNECTION_HANDLE_ENABLED:false, SUPERVISOR_INTERVAL:30000, SUPERVISOR_READ_TIMEOUT:10000, SUPERVISOR_MAX_FAILURES:5
dremio-executor-1  | 2023-12-26 22:40:53,906 [Curator-ConnectionStateManager-0] INFO  c.d.s.coordinator.zk.ZKClusterClient - ZKClusterClient: new state received[CONNECTED] - isConnected: true
dremio-executor-1  | 2023-12-26 22:40:54,104 [main] INFO  c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service coordinator and role COORDINATOR
dremio-executor-1  | 2023-12-26 22:40:54,114 [main] INFO  c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service executor and role EXECUTOR
dremio-executor-1  | 2023-12-26 22:40:54,125 [main] INFO  c.d.s.c.z.ZKClusterServiceSetManager - Started zkServiceSet for service master and role MASTER
dremio-executor-1  | 2023-12-26 22:40:54,125 [main] INFO  c.d.s.c.zk.ZKClusterCoordinator - ZKClusterCoordination is up
dremio-executor-1  | 2023-12-26 22:40:54,125 [main] INFO  c.d.s.c.TaskLeaderStatusListener - Starting TaskLeaderStatusListener for: MASTER
dremio-executor-1  | 2023-12-26 22:40:54,126 [main] INFO  c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER cec009d49b33:45678 registered itself.
dremio-executor-1  | 2023-12-26 22:40:54,126 [main] INFO  c.d.s.c.TaskLeaderStatusListener - TaskLeaderStatusListener for: MASTER is up

From the executor I’m able to successfully connect to any of the open ports on the master.

I can’t figure out what the RPC errors are about - I don’t see anything about explicitly enabling ports or services for RPC.

The only other thing I can think of is that within the Node Activity page I see that master node is reporting an “internal” address assigned by Docker rather than its external address of 192.10.0.76.

Is the master try to connect back out to the executor?

Any advice would be much appreciated.

Thanks!

Eric

balaji.ramaswamy · December 27, 2023, 5:15am

Hi @eellsworth

Did you see this error? Looks like host name resolution issue

dremio-executor-1  | java.util.concurrent.ExecutionException: java.net.UnknownHostException: bcd2ae315e7e: Temporary failure in name resolution

Thanks
Bali

eellsworth · December 28, 2023, 2:37am

Bali,
Thanks!
I have a feeling that the problem originates from the fact that my containers are running in Docker bridge networks, and they are assigned internal IPs, e.g. 172.21.0.3 that are different from the external ones.

I can add static hostnames to /etc/hosts via the extra_hosts argument in docker-compose:

    extra_hosts:
      - "dremio_coordinator=192.10.0.76"
      - "dremio_executor_1=192.10.0.77"
   hostname: dremio_coordinator

But this results in many more errors on the coordinator - I think there must be conflicts between the IP the docker container reports to Dremio and IP associated with the hostname in /etc/hosts

I am going to try deploying coordinator and executor in a Docker overlay network - I believe this way both will be referencing internal Docker IP addresses and hostnames.

Thanks so much for the help.

eellsworth · January 7, 2024, 3:47pm

An update. It turns out that once I connected both machines to the same Docker overlay network that the hostname issues were resolved.

Thanks again for the advice!

Eric

balaji.ramaswamy · January 8, 2024, 5:34am

Thanks for the update @eellsworth

Glad to know the issue is resolved

Topic		Replies	Views
Dremio Distributed Cluster Issues	13	834	November 22, 2023
Multiple executors on same node	14	3213	November 27, 2019
Zookeeper configuration, please help	4	1907	March 23, 2020
Set Up a Cluster on EC2	7	1848	October 29, 2018
Dremio 3.1 docker image distributed deployment issue	3	1371	February 22, 2019

Docker deployment (no K8s) coordinator +3 executors

Related topics