Dremio On-Premise Server Down - ERROR ROOT Dremio is exiting. Node lost its master status

Dremio server was down suddenly. I’m using single node mode on Centos7. The error says that the node lost its master status. Any ideas how to solve this?

Here’s the error log from server.log:
2021-12-28 05:39:05,469 [e2 - 1e355d62-b9df-b226-f963-1346446fbb00:frag:0:0] INFO c.d.s.e.f.FragmentStatusReporter - 1e355d62-b9df-b226-f963-1346446fbb00:0:0: State to report: FINISHED
2021-12-28 05:39:05,475 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355d62-b9df-b226-f963-1346446fbb00”,“schema”:"[]",“queryText”:“REFRESH REFLECTION ‘d2fee23e-e7de-40b2-a9bc-a320c78a7423’ AS ‘028ed699-0e24-4399-a32f-094b1f621464’”,“start”:1640669852111,“finish”:1640669945471,“outcome”:“COMPLETED”,“username”:”$user1$”}
2021-12-28 05:39:21,415 [LuceneSearchIndex:committer materialization_store] WARN com.dremio.datastore.WarningTimer - LuceneSearchIndex commit took 219592 ms
2021-12-28 05:39:28,665 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
2021-12-28 05:39:28,666 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - Lost latch 89a5ed6a-8657-4c40-9657-58573781f37a for election master.
2021-12-28 05:39:36,745 [dremio-general-407] WARN com.dremio.datastore.WarningTimer - Reflection Manager took 59120 ms
2021-12-28 05:39:36,797 [LuceneSearchIndex:committer metadata-dataset-splits] WARN com.dremio.datastore.WarningTimer - LuceneSearchIndex commit took 238469 ms
2021-12-28 05:39:36,803 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355ce7-7d34-101b-6d0b-3c4f16223000”,“schema”:"[]",“queryText”:“LOAD MATERIALIZATION METADATA “d2fee23e-e7de-40b2-a9bc-a320c78a7423”.“028ed699-0e24-4399-a32f-094b1f621464"”,“start”:1640669976674,“finish”:1640669976801,“outcome”:“COMPLETED”,“username”:”$user1$”}
2021-12-28 05:39:59,592 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
2021-12-28 05:40:28,668 [zk-curator-8] INFO c.d.s.coordinator.zk.ZKClusterClient - Not able to get election status in 60000ms. Cancelling election…
2021-12-28 05:40:28,669 [zk-curator-8] INFO c.d.s.coordinator.zk.ZKClusterClient - New leader elected. Invoke cancel on listener
2021-12-28 05:40:29,100 [LuceneSearchIndex:committer dac-namespace] WARN com.dremio.datastore.WarningTimer - LuceneSearchIndex commit took 116993 ms
2021-12-28 05:40:50,563 [LuceneSearchIndex:committer refresh_store] WARN com.dremio.datastore.WarningTimer - LuceneSearchIndex commit took 147345 ms
2021-12-28 05:40:50,570 [zk-curator-8] ERROR ROOT - Dremio is exiting. Node lost its master status.
2021-12-28 05:40:50,573 [Curator-ServiceCache-0] WARN c.d.exec.maestro.FragmentTracker - Nodes [LOCALHOST:0] no longer registered in cluster. Canceling query 1e355ec4-08c7-2363-eca2-17d19a33b400
2021-12-28 05:40:50,574 [Curator-ServiceCache-0] INFO c.d.exec.work.foreman.AttemptManager - 1e355ec4-08c7-2363-eca2-17d19a33b400: State change requested RUNNING → FAILED, Exception com.dremio.common.exceptions.ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.dremio.sabot.exec.FragmentHandler - set cancel for fragment 0:0
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.d.s.exec.fragment.FragmentExecutor - Cancellation requested for fragment 1e355ec4-08c7-2363-eca2-17d19a33b400:0:0.
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.dremio.sabot.exec.FragmentHandler - set cancel for fragment 1:1
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.d.s.exec.fragment.FragmentExecutor - Cancellation requested for fragment 1e355ec4-08c7-2363-eca2-17d19a33b400:1:1.
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.dremio.sabot.exec.FragmentHandler - set cancel for fragment 1:2
2021-12-28 05:40:50,575 [FABRIC-rpc-event-queue] INFO c.d.s.exec.fragment.FragmentExecutor - Cancellation requested for fragment 1e355ec4-08c7-2363-eca2-17d19a33b400:1:2.
2021-12-28 05:40:50,576 [FABRIC-rpc-event-queue] INFO c.dremio.sabot.exec.FragmentHandler - set cancel for fragment 1:0
2021-12-28 05:40:50,576 [FABRIC-rpc-event-queue] INFO c.d.s.exec.fragment.FragmentExecutor - Cancellation requested for fragment 1e355ec4-08c7-2363-eca2-17d19a33b400:1:0.
2021-12-28 05:40:50,597 [dremio-general-407] WARN com.dremio.datastore.WarningTimer - Reflection Manager took 32255 ms
2021-12-28 05:40:50,713 [e2 - 1e355ec4-08c7-2363-eca2-17d19a33b400:frag:0:0] INFO c.d.s.e.f.FragmentStatusReporter - 1e355ec4-08c7-2363-eca2-17d19a33b400:0:0: State to report: CANCELLED
2021-12-28 05:40:59,020 [dremio-general-407] WARN com.dremio.datastore.WarningTimer - Reflection Manager took 8422 ms
2021-12-28 05:40:59,035 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355c93-ac19-fa1a-6743-e038889f0800”,“schema”:”[]",“queryText”:“DROP TABLE IF EXISTS “__accelerator”.“1e044ab2-9b6d-40a4-b3d0-0e3a887e444b”.“60b66a34-f29c-4d5d-9af9-eac8a241e7ed””,“start”:1640670059016,“finish”:1640670059033,“outcome”:“COMPLETED”,“username”:”$user1$”}
2021-12-28 05:41:01,393 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:2] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:2: State to report: FINISHED
2021-12-28 05:41:01,456 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:5] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:5: State to report: FINISHED
2021-12-28 05:41:04,274 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:4] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:4: State to report: FINISHED
2021-12-28 05:41:05,127 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:3] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:3: State to report: FINISHED
2021-12-28 05:41:05,920 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:1] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:1: State to report: FINISHED
2021-12-28 05:41:05,925 [e13 - 1e355ec4-08c7-2363-eca2-17d19a33b400:frag:1:2] INFO c.d.s.e.f.FragmentStatusReporter - 1e355ec4-08c7-2363-eca2-17d19a33b400:1:2: State to report: CANCELLED
2021-12-28 05:41:06,260 [e2 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:0:0] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:0:0: State to report: FINISHED
2021-12-28 05:41:06,261 [FABRIC-rpc-event-queue] INFO c.dremio.sabot.exec.FragmentHandler - set cancel for fragment 1:0
2021-12-28 05:41:06,261 [FABRIC-rpc-event-queue] INFO c.d.s.exec.fragment.FragmentExecutor - Cancellation requested for fragment 1e355e92-1e7f-b4ea-c189-ae364212a500:1:0.
2021-12-28 05:41:06,279 [e4 - 1e355e92-1e7f-b4ea-c189-ae364212a500:frag:1:0] INFO c.d.s.e.f.FragmentStatusReporter - 1e355e92-1e7f-b4ea-c189-ae364212a500:1:0: State to report: FINISHED
2021-12-28 05:41:06,291 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355e92-1e7f-b4ea-c189-ae364212a500”,“schema”:"[@admin]",“queryText”:“select * from hdfs.“WOM_PROD”.DWH.“WOM_DWH_COMMON.db”.“FACT_AGREEMENT””,“start”:1640669553229,“finish”:1640670066281,“outcome”:“COMPLETED”,“username”:“admin”}
2021-12-28 05:41:13,048 [dremio-general-407] WARN com.dremio.datastore.WarningTimer - Reflection Manager took 12475 ms
2021-12-28 05:41:13,067 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355c86-2111-2deb-0748-ff54fbc5bd00”,“schema”:"[]",“queryText”:“DROP TABLE IF EXISTS “__accelerator”.“0a63f9ae-287c-47eb-aa22-86e59956fd2f”.“9221b472-35da-4dd3-8f9a-43187e95c175"”,“start”:1640670073044,“finish”:1640670073065,“outcome”:“COMPLETED”,“username”:”$user1$”}
2021-12-28 05:41:19,072 [LuceneSearchIndex:committer jobs] WARN com.dremio.datastore.WarningTimer - LuceneSearchIndex commit took 309366 ms
2021-12-28 05:41:27,015 [pool-4065-thread-2] INFO c.d.s.c.CacheEvictionThread - Eviction Request 266 info: remaining bytes to evict 0, current mpMap size 0 and spMap size 0 total iterations 0
2021-12-28 05:41:27,020 [1e355ec4-08c7-2363-eca2-17d19a33b400:foreman] ERROR c.d.exec.work.foreman.AttemptManager - ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].
com.dremio.common.exceptions.UserException: ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:885)
at com.dremio.exec.work.foreman.AttemptManager$AttemptResult.close(AttemptManager.java:661)
at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:855)
at com.dremio.exec.work.foreman.AttemptManager.access$2200(AttemptManager.java:94)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:906)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:898)
at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:105)
at com.dremio.common.EventProcessor.sendEvent(EventProcessor.java:63)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.addEvent(AttemptManager.java:901)
at com.dremio.exec.work.foreman.AttemptManager.addToEventQueue(AttemptManager.java:911)
at com.dremio.exec.work.foreman.AttemptManager$CompletionListenerImpl.failed(AttemptManager.java:222)
at com.dremio.exec.maestro.FragmentTracker.checkAndUpdateFirstError(FragmentTracker.java:307)
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:203)
at com.dremio.exec.maestro.FragmentTracker$ExecutorNodeStatusListener.nodesUnregistered(FragmentTracker.java:348)
at com.dremio.service.coordinator.LocalExecutorSetService$1$1.nodesUnregistered(LocalExecutorSetService.java:98)
at com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered(AbstractServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints(ZKServiceSet.java:127)
at com.dremio.service.coordinator.zk.ZKServiceSet.access$000(ZKServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged(ZKServiceSet.java:53)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:197)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:193)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent(ServiceCacheImpl.java:191)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530)
at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.dremio.common.exceptions.ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:204)
… 27 common frames omitted
2021-12-28 05:41:27,023 [Curator-ServiceCache-0] WARN c.d.exec.work.foreman.AttemptManager - Dropping request to move to COMPLETED state as query 1e355ec4-08c7-2363-eca2-17d19a33b400 is already at FAILED state (which is terminal).
2021-12-28 05:41:27,030 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355ec4-08c7-2363-eca2-17d19a33b400”,“schema”:”[]",“queryText”:“REFRESH REFLECTION ‘0a3abb1e-1d7a-46ab-a309-da38e1944b27’ AS ‘dafd27be-dc8b-4ee9-a732-894979081880’”,“start”:1640669499494,“finish”:1640670087019,“outcome”:“FAILED”,“outcomeReason”:“ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].”,“username”:”$user1$”}
2021-12-28 05:41:30,389 [e13 - 1e355ec4-08c7-2363-eca2-17d19a33b400:frag:1:0] INFO c.d.s.e.f.FragmentStatusReporter - 1e355ec4-08c7-2363-eca2-17d19a33b400:1:0: State to report: CANCELLED
2021-12-28 05:41:30,448 [e0 - 1e355ec4-08c7-2363-eca2-17d19a33b400:frag:1:1] WARN c.d.e.s.parquet.ParquetRecordWriter - DHL: ParquetFileWriter took too long (writeBlockDeltaTime 341591 and footerWriteAndFlushDeltaTime 17264 milli secs) for writing 5195871 records (268622755 bytes) to file /opt/dremio-community-19.1.0-202111160130570172-0ee00450/data/pdfs/accelerator/0a3abb1e-1d7a-46ab-a309-da38e1944b27/dafd27be-dc8b-4ee9-a732-894979081880_0/LOCALHOST@1_1_0.parquet at 0.714 Mbps
2021-12-28 05:41:30,491 [e0] WARN c.d.sabot.task.slicing.SlicingThread - DHL: Execution of task 1e355ec4-08c7-2363-eca2-17d19a33b400:1:1 took too long. startTime 5684632268308400 ns; stopTime 5684991309968300 ns; delta 359041 ms
2021-12-28 05:41:30,491 [e0] INFO c.d.sabot.task.slicing.SlicingThread - CpuUsageStats: p_avg 0.125000 (tota-cpus 8), p_min 0.061232 (cpu-id 0), p_max 0.153212 (cpu-id 2), cpu-load-score 0.986754 (actual entropy 2.960261, ideal entropy 3.000000)
2021-12-28 05:41:35,536 [SlicingThreadMonitor] INFO c.d.s.t.slicing.SlicingThreadMonitor - Marking thread 0 for Migration candidate as its runningTime is 5044 ms
2021-12-28 05:41:38,037 [SlicingThreadMonitor] INFO c.d.s.t.slicing.SlicingThreadMonitor - Migrating thread 0 as its runningTime is 7545 ms
2021-12-28 05:41:38,037 [SlicingThreadMonitor] INFO c.d.scheduler.cfs.MonitoredThreadMQ - Migrated 0 runQtasks 0 stagingTasks from thread 0 to thread 1
2021-12-28 05:41:39,886 [dremio-general-407] WARN com.dremio.datastore.WarningTimer - Reflection Manager took 26838 ms
2021-12-28 05:41:39,917 [out-of-band-observer] INFO query.logger - {“queryId”:“1e355c79-1d34-6ed8-ecd4-1548489c0400”,“schema”:"[]",“queryText”:"DROP TABLE IF EXISTS “__accelerator”.“29926c76-f181-47dc-9b95-31ec89f68f0d”.“8f01eb52-19ac-4c0f-86ab-2cdcc38836f7"”,“start”:1640670086312,“finish”:1640670099915,“outcome”:“COMPLETED”,“username”:”$user1$”}
2021-12-28 05:41:40,537 [SlicingThreadMonitor] INFO c.d.s.t.slicing.SlicingThreadMonitor - Deprecating Slicing Thread0 as its runningTime is 10045 ms
2021-12-28 05:41:51,525 [e0] INFO c.d.sabot.task.slicing.SlicingThread - CpuUsageStats: p_avg 0.125000 (tota-cpus 8), p_min 0.061232 (cpu-id 0), p_max 0.153211 (cpu-id 2), cpu-load-score 0.986753 (actual entropy 2.960259, ideal entropy 3.000000)

@andrean_s - From the logs, I can see this error that caused this issue.

Caused by: com.dremio.common.exceptions.ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [LOCALHOST:0].

Can you check the GC logs, If there was FULL GC happening? Also, what’s your heap and direct memory.

@Nitin_Garg where I can find the GC log, is it in server.gc? I didn’s find FULL GC happening in server.gc. I’m using DREMIO_MAX_HEAP_MEMORY_SIZE_MB=8192 and DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=16384. I already attached my server.log server.gc and server.out, hope it helps, thank you
server.zip (378.1 KB)

@andrean_s

I do not find a Full GC in the GC log file, check in the rolled over GC logs, should be under the same log folder as server.gc.[1-5], the master is more heap intensive than direct so recommend switching the heap and direct memory, restart coordinator, if it still happens then upload all the 6 server.gc* files
DREMIO_MAX_HEAP_MEMORY_SIZE_MB=16384
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=8192