Dremio HA on K8s - web port in chart v1 vs v2

jmuszynski · April 24, 2024, 7:12pm

Hi

I wonder why the coordinator’s in helm chart v2 does not expose web port 9047.
This slightly breaks the HA of Dremio in k8s, and I’m not sure if there is a reason not to loadbalance traffic to all coordinator nodes (master and coordinators).

Chart v1 was exposing port 9047 to a coordinator node

github.com

dremio/dremio-cloud-tools/blob/master/charts/dremio/templates/dremio-coordinator.yaml#L61


      
                -Dzookeeper=zk-hs:2181
                -Dservices.coordinator.master.enabled=false
                -Dservices.executor.enabled=false
                {{- if .Values.extraStartParams }}
                {{ .Values.extraStartParams }}
                {{- end }}
            command: ["/opt/dremio/bin/dremio"]
            args:
            - "start-fg"
            ports:
            - containerPort: 9047
              name: web
            - containerPort: 31010
              name: client
            - containerPort: 45678
              name: server
          initContainers:
          - name: wait-for-zk
            image: busybox
            command:  ["sh", "-c", "until nc -z dremio-client {{ .Values.coordinator.web.port | default 9047 }} > /dev/null; do echo waiting for dremio master; sleep 2; done;"]
          {{- if .Values.tls.ui.enabled }}

while chart v2 does not

github.com

dremio/dremio-cloud-tools/blob/master/charts/dremio_v2/templates/dremio-coordinator.yaml#L79


      
              -Dservices.conduit.port=45679
          - name: AWS_CREDENTIAL_PROFILES_FILE
            value: "/opt/dremio/aws/credentials"
          - name: AWS_SHARED_CREDENTIALS_FILE
            value: "/opt/dremio/aws/credentials"
          - name: DREMIO_LOG_TO_CONSOLE
            value: "true"
          {{- include "dremio.coordinator.log.path" $ | nindent 8 }}
          command: ["/opt/dremio/bin/dremio"]
          args: ["start-fg"]
          ports:
          - containerPort: 31010
            name: client
          - containerPort: 32010
            name: flight
          - containerPort: 45678
            name: server-fabric
          - containerPort: 45679
            name: server-conduit
          startupProbe:
            httpGet:

With chart v2, if the dremio-master is unavailable, dremio api is unavailable (even if a coordinator is working correctly in the cluster).

The docs states

Dremio’s web application can be made highly available by leveraging more than one coordinator node and a reverse proxy/load balancer.
All web clients connect to a the load balancer rather than directly connecting to an individual coordinator node. The load balancer distributes these connections to the current active master node.
https:// docs.dremio. com/current/get-started/cluster-deployments/customizing-configuration/dremio-conf/high-availability-config/ [sorry, only 2 links in the post]

It is not clear to me, how LB (the nginx in the example) could distinguish current master node out of all available coordinator nodes, as they all expose port 9047…

I did extend chart v2 dremio-coordinator.yaml with

   ports:

    - containerPort: 9047
      name: web

and it seems to work well, web and api remain available even if the dremio-master pod is turned of (given the coordinator is present in the cluster).

Is there any potential issue that could happen to the dremio cluster?

jmuszynski · April 24, 2024, 7:41pm

To give more insights, I have mainly workload that uses REST API calls, no ODBC/JDBC… not sure how this relatest to that quote

Configure multiple secondary coordinator nodes to improve concurrency and distribute query planning for ODBC and JDBC client requests to your deployment. To distribute queries across your deployment, Dremio recommends that you also configure ZooKeeper.
Secondary coordinator nodes run ODBC and JDBC queries only from clients that connect to the secondary coordinator either directly or with a load balancer.

balaji.ramaswamy · May 3, 2024, 7:34pm

@jmuszynski By default UI port is 9047, if master goes down it should come up on another pod automatically. HA is different from scale out and this is very different between K8s and the Cloud version On K8’s the master is SPOF as when it goes down the application is not accessible. If scale out goes down in software which only fails queries running on that scale out. In Cloud, coordinator is masterless and hence it is not a SPOF

Topic		Replies	Views
Isolate Dremio Services in Kubernetes	2	1340	October 19, 2022
Load balancing REST API requests across master and coordinator nodes Dremio University	3	554	August 24, 2023
Dremio on Raspberry Pi	1	1484	September 22, 2018
Dremio Kerberized Hive2 Connection Help Dremio University	3	115	October 19, 2024
Unable deploy Dremio to Kubernetes with Helm	3	1322	November 12, 2020

Dremio HA on K8s - web port in chart v1 vs v2

Related topics