Dremio-master-0 restart loop after k8s node problem

Hi Everyone,

I have issue with my Dremio deployment. It was deployed on top of K3S cluster consisting 3 nodes and has been running for about 2 months.

Yesterday while doing a maintenance on the servers, one of the K3S node got unresponsive and had to be stopped. Unfortunately it’s where the dremio-master pod is running.
Now dremio-master StatefulSet status is In Progress, while dremio-master-0 status is Running with note Containers with unready status: [dremio-master-coordinator].

Looking at the dremio-master-0 logs, I see that it tried to recover:

INFO c.d.datastore.CoreStoreProviderImpl - Dremio was not stopped properly, so the indexes need to be synced with the stores. This may take a while..

It then removing some old meters until this message:

WARN com.dremio.telemetry.api.Telemetry - Failure reading telemetry configuration. Leaving telemetry as is. java.lang.IllegalArgumentException: resource dremio-telemetry.yaml not found.

It continue with MASTER election process and then another message:

WARN c.d.e.catalog.MetadataSynchronizer - Source ‘sys’ sync failed unexpectedly. Will try again later
java.lang.NullPointerException: Master coordinator is down

dremio-master-0_dremio-master-coordinator.zip (23.6 KB)
Attached is the full log of dremio-master-0 pod during one of it crash loop.

Appreciate if someone could help me to resolve this issue. Thank you.

@airawan

Sorry for the late reply, not usre if this is still an issue or you have moved ahead? looking at the logfile you sent, I see an unknown host exception. Can you check DNS resolution?

Unable to resolve host dremio-master-0.dremio-cluster-pod.default.svc.cluster.local

Hi @balaji.ramaswamy ,

The issue was solved by itself. I’m not sure what happened over the weekend, but when I checked again on Monday morning dremio master was up. It took a while (6 days) but eventually dremio-master-0 is up.
I’m not really sure how it works, but do you think perhaps coredns entry was pointing to the old dremio-master (the one the stopped) until it expired? I didn’t think to check it back then.
Thanks for your response, I’ll look further if I get into the same situation in the future.