Unexpected Coordinator service shutting down

Hi everyone,

I woke up one morning to the Coordinator service no longer available. All the Executor services are running fine in the cluster. I take a look at the server.log. And here’s the log before it went down.

2021-12-27 15:14:20,557 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
2021-12-27 15:14:20,557 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - Lost latch 73bd37be-ca13-40bb-88d6-dd76260d59b5 for election master.
2021-12-27 15:14:22,101 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
2021-12-27 15:14:24,324 [zk-curator-12] INFO c.d.s.coordinator.zk.ZKClusterClient - New leader elected. Invoke cancel on listener
2021-12-27 15:14:24,337 [zk-curator-12] ERROR ROOT - Dremio is exiting. Node lost its master status.
2021-12-27 15:14:24,466 [main-EventThread] INFO c.d.s.coordinator.zk.ZKClusterClient - Acquired latch 73bd37be-ca13-40bb-88d6-dd76260d59b5 for election master.
2021-12-27 15:14:24,498 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED

Here’s the server.out
CLOSING DOWN CONNECTION AS IT COULD NOT BE RETURNED TO THE POOL
EXCEPTION WHILE DESTROYING OBJECT org.apache.commons.dbcp2.datasources.PooledConnectionAndInfo@f596b02
java.lang.NullPointerException
at org.apache.commons.pool2.impl.GenericKeyedObjectPool.invalidateObject(GenericKeyedObjectPool.java:578)
at org.apache.commons.dbcp2.datasources.KeyedCPDSConnectionFactory.connectionClosed(KeyedCPDSConnectionFactory.java:264)
at oracle.jdbc.pool.OraclePooledConnection.callListener(OraclePooledConnection.java:518)
at oracle.jdbc.pool.OraclePooledConnection.logicalClose(OraclePooledConnection.java:489)
at oracle.jdbc.driver.LogicalConnection.closeInternal(LogicalConnection.java:307)
at oracle.jdbc.driver.LogicalConnection.close(LogicalConnection.java:268)
at com.dremio.exec.store.jdbc.JdbcSchemaFetcherImpl.$closeResource(JdbcSchemaFetcherImpl.java:339)
at com.dremio.exec.store.jdbc.JdbcSchemaFetcherImpl.getState(JdbcSchemaFetcherImpl.java:611)
at com.dremio.exec.store.jdbc.JdbcStoragePlugin.getState(JdbcStoragePlugin.java:327)
at com.dremio.exec.catalog.ManagedStoragePlugin.lambda$refreshState$7(ManagedStoragePlugin.java:946)
at java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

@borasy

Add the below to your dremio-env and restart the coordinator, then next time this happens, attach sever.gc.[1-5] and server.gc from the Dremio master coordinator log folder

DREMIO_JAVA_SERVER_EXTRA_OPTS="-XX:+PrintClassHistogramBeforeFullGC -XX:+PrintClassHistogramAfterFullGC -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:MaxGCPauseMillis=500 -XX:InitiatingHeapOccupancyPercent=25"

Okay, i will add that and report back if it crushes again.