Dremio-master-coordinator unable to access volume cause CrashLoopBackOff

Hello everyone,
I try to deploy dremio on an Azure Kubernetes Service, I already modified values.yaml to fit my cluster capacity, zookeeper and executor are running whithout problems.
I do not really have an idea of what’s going on, but I have collected few clues.
So you may have guessed, my pod dremio-master-0 crash when dremio-master-coordinator tries to access /opt/dremio/data.

java.io.IOException: path /opt/dremio/data is not writable.
        at com.dremio.dac.daemon.PathUtils.checkWritePath(PathUtils.java:58)
        at com.dremio.dac.daemon.DACDaemon.<init>(DACDaemon.java:159)
        at com.dremio.dac.daemon.DACDaemon.newDremioDaemon(DACDaemon.java:316)
        at com.dremio.dac.daemon.DACDaemon.newDremioDaemon(DACDaemon.java:324)
        at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:103)

I also have an error when coordinator tries to write in the opt/dremio/log/ folder, this is the fatal error for the pod

at java.io.FileNotFoundException: /opt/dremio/log/hive.deprecated.function.warning.log (No such file or directory)

So I inspected and found that the InitContainer “upgrade-task” return this log
I don’t think this is the normal behaviour …

Database not found. Skipping upgrade.

I tried to work around the problem by running as root (user 0)

- name: dremio-master-coordinator
  image: {{ $.Values.image }}:{{ $.Values.imageTag }}
  imagePullPolicy: IfNotPresent
  securityContext:
    runAsUser: 0

It gives me multiple errors;

I am at this point, I don’t understand how to fix this but maybe you encountered one them.

@Nicolas-Malgat Why is /opt/dremio/data not writable or is it a false message?

Okay, I edited a few files in the dremio chart to fit a previous installation documentation which is not relevant for me. I’m back to the original github chart with my custom values.yaml now.

The file not found Exception is because I removed the following initContainers: wait-for-zookeeper and chown-data-directory.

The “Illegal base64 character 5f” error is caused by an underscore character. It was because I copy-pasted the core-site.xml here.

I encountered an additionnal problem, an authentification failed on my azure datalake gen2 because I used the wrong access key to my ADLS gen2.

Caused by: java.lang.RuntimeException: {"error":{"code":"AuthenticationFailed","message":"Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.\nRequestId:bdef46e3-a01f-0050-76e5-15dc12000000\nTime:2022-12-22T09:12:59.0961111Z","detail":
{"AuthenticationErrorDetail":
"The MAC signature found in the HTTP request 'AmSVQNkzlS90LXdYqISo8kQDAofFp5LleihGmwHHuq4=' is not the same as any computed signature. Server used following string to sign: 'GET\n\n\n\n\n\n\n\n\n\n\n\nx-ms-client-request-id:86ff6f9b-9ada-4d36-8c78-0c55a4d7a3ca\nx-ms-date:Thu, 22 Dec 2022 09:12:59 GMT\nx-ms-version:2019-07-07\n\/dlsdremiodev\/\ncontinuation:\nmaxresults:100\nresource:account'."}}}

@Nicolas-Malgat Are we all clear now? or do we still have any open issues. Sorry, just asking so we can help

I have new problems maybe it needs another issue post.

My main problem right now is that I can’t connect to service, there is no reaction from pods when I try to reach public IP + port 9047.

In logs I got few messages that you already answered on commmunity.dremio.com;

  • “Could not find session with sessionId”
    For this one I connected to my master pod’s shell but there is no netstat installed and didn’t find out how to upgrade to superuser to install it
  • “Master coordinator is down”
    I tried to access logs and added below configuration in “env:” after line 81 of dremio-master.yaml
        - name: DREMIO_LOG_TO_CONSOLE
          value: "0"
        - name: DREMIO_LOG_DIR
          value: "/opt/dremio/data/log"

but logs still try to write in /opt/dremio/log/

java.io.FileNotFoundException: /opt/dremio/log/server.log (No such file or directory)

nicolas_logs.zip (32,7 Ko)

I send a short and full version of my kubectl logs to help. Short version is a grep of WARN and Exception keyword. I also collected logs with a more powerfull cluster to test if it was the problem, I labeled “POWER” in that case.
Some Liveness probe and Readiness probe pod events happens but I think there is nothing to worry about since I doesn’t show in my “powerfull cluster test”.

@Nicolas-Malgat Did you see the below? This is from the executor

java.util.concurrent.ExecutionException: java.net.UnknownHostException: dremio-master-0.dremio-cluster-pod.default.svc.cluster.local: Name or service not known

Hey I’m back to work,

So I checked and saw that error but I don’t really have a clue of what to do about it.
Am I supposed to set a “publicly accessible DNS name” aka static url to my AKS ?
I don’t think I can set this adress in my values.yaml.
I’m using Microsoft Azure for my cluster.

EDIT: using a different internet connection than my company, I get the dremio login page.
:face_with_head_bandage:

@Nicolas-Malgat That’s great, is there a firewall that is blocking the port when you are trying within your office network?

Yes, exactly.
I think the issue can be closed. Thanks you for the help :slight_smile:

Thanks for update @Nicolas-Malgat and most welcome