Hi Everyone,
I have issue with my Dremio deployment. It was deployed on top of K3S cluster consisting 3 nodes and has been running for about 2 months.
Yesterday while doing a maintenance on the servers, one of the K3S node got unresponsive and had to be stopped. Unfortunately it’s where the dremio-master pod is running.
Now dremio-master StatefulSet status is In Progress
, while dremio-master-0 status is Running
with note Containers with unready status: [dremio-master-coordinator]
.
Looking at the dremio-master-0 logs, I see that it tried to recover:
INFO c.d.datastore.CoreStoreProviderImpl - Dremio was not stopped properly, so the indexes need to be synced with the stores. This may take a while..
It then removing some old meters until this message:
WARN com.dremio.telemetry.api.Telemetry - Failure reading telemetry configuration. Leaving telemetry as is. java.lang.IllegalArgumentException: resource dremio-telemetry.yaml not found.
It continue with MASTER election process and then another message:
WARN c.d.e.catalog.MetadataSynchronizer - Source ‘sys’ sync failed unexpectedly. Will try again later
java.lang.NullPointerException: Master coordinator is down
dremio-master-0_dremio-master-coordinator.zip (23.6 KB)
Attached is the full log of dremio-master-0 pod during one of it crash loop.
Appreciate if someone could help me to resolve this issue. Thank you.