Dremio-master-0 restart loop after k8s node problem

Hi Everyone,

I have issue with my Dremio deployment. It was deployed on top of K3S cluster consisting 3 nodes and has been running for about 2 months.

Yesterday while doing a maintenance on the servers, one of the K3S node got unresponsive and had to be stopped. Unfortunately it’s where the dremio-master pod is running.
Now dremio-master StatefulSet status is In Progress, while dremio-master-0 status is Running with note Containers with unready status: [dremio-master-coordinator].

Looking at the dremio-master-0 logs, I see that it tried to recover:

INFO c.d.datastore.CoreStoreProviderImpl - Dremio was not stopped properly, so the indexes need to be synced with the stores. This may take a while..

It then removing some old meters until this message:

WARN com.dremio.telemetry.api.Telemetry - Failure reading telemetry configuration. Leaving telemetry as is. java.lang.IllegalArgumentException: resource dremio-telemetry.yaml not found.

It continue with MASTER election process and then another message:

WARN c.d.e.catalog.MetadataSynchronizer - Source ‘sys’ sync failed unexpectedly. Will try again later
java.lang.NullPointerException: Master coordinator is down

dremio-master-0_dremio-master-coordinator.zip (23.6 KB)
Attached is the full log of dremio-master-0 pod during one of it crash loop.

Appreciate if someone could help me to resolve this issue. Thank you.