Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job

Nagaraj_Malaiappan · August 23, 2018, 4:03pm

Our Setup: All below in kubernetes cluster

Dremio build: 2.0.5-201806021755260067-767cfb5-mapr, Community Edition.

1 Master pod: 64 GB RAM, 8 Core, functions as coordinator and executor, Heap Memory- 4GB, Direct Memory- 59 GB, zk.client.session.timeout: 90000 (in dremio.conf), Configured to connect to external zk quorum with 3 nodes

3 executor pods: 64 GB RAM, 8 Core, functions executor, Heap Memory- 4GB, Direct Memory- 59 GB, zk.client.session.timeout: 90000 (in dremio.conf), Configured to connect to external zk quorum with 3 nodes

Issue:

Log from Master pod:
2018-08-23 15:08:46,002 [Curator-ServiceCache-0] WARN c.d.exec.work.foreman.QueryManager - Nodes [10.244.8.37:-1] no longer registered in cluster. Canceling query 248132e8-9e82-0d9a-8afa-4561c42f3300 2018-08-23 15:08:46,040 [Curator-ServiceCache-0] INFO query.logger - {“queryId”:“248132e8-9e82-0d9a-8afa-4561c42f3300”,“schema”: ,“start”:1535036694087,“finish”:1535036926028,“outcome”:“FAILED”,“commandDescription”:“execute; query”} 2018-08-23 15:08:46,101 [Curator-ServiceCache-0] ERROR c.d.exec.work.foreman.AttemptManager - ForemanException: One more more nodes lost connectivity during query. Identified nodes were [10.244.8.37:-1]. com.dremio.common.exceptions.UserException: ForemanException: One more more nodes lost connectivity during query. Identified nodes were [10.244.8.37:-1].

Log from slave pod which got zk connection suspended [10.244.8.37:-1].

2018-08-23 15:06:11,534 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 3 of rpc type 3 took longer than 500ms. Actual duration was 3273ms.
2018-08-23 15:09:36,594 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
2018-08-23 15:09:37,351 [FABRIC-1] WARN c.d.services.fabric.FabricServer - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 946ms.

Help Needed:

Any configuration to avoid the zk connection getting suspended?
Any Configuration in master to increase the slave node’s zk connection loss check [Since slave nodes zk connections are getting reconnected, would it be feasible to wait instead of master cancelling the running job? ]

yufeldman · August 23, 2018, 5:32pm

Any configuration to avoid the zk connection getting suspended?

Usually problems with ZK timeouts (suspending connection) are related to memory pressure in the process itself or on the machine. This can be checked by enabling gc logging.
Regarding timeout value - unless ZK servers set with at least 90 sec. timeout on their side ZK client timeout is min(ZKClientSetTimeout, ZKServerSetTimeout). I believe by default timeout set on ZKServer side is 40 sec.

Any Configuration in master to increase the slave node’s zk connection loss check

It is related to the above and we don’t have much wiggle room there

unni · September 23, 2020, 7:20am

Facing a similar issue Dremio 4.5 deployed on AKS.
The dremio-master pod gets restarted because of health check failures.
We are running 1 master, 2 executor and 3 zookeeper pods. Zookeeper resources are the default configurations provided in the helm chart.

2020-09-23 07:06:07,639 [Curator-ConnectionStateManager-0] INFO  c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
    2020-09-23 07:06:09,537 [Curator-ConnectionStateManager-0] INFO  c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
    2020-09-23 07:05:39,615 [Curator-ServiceCache-0] INFO  c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER

balaji.ramaswamy · September 24, 2020, 10:47pm

@unni

This can happen due to 2 reasons

#1 Zookeeper pod had an issue, for this we need to review the ZK pod logs
#2 Dremio coordinator (master) had a Full GC. To get to the bottom of this, we need to write GC logs and Histograms to a PVC so even though we lost the pod we have the information

Kubernetes: Open dremio-master.yaml and dremio-executor.yaml under templates
Add below under the DREMIO_JAVA_EXTRA_OPTS section, redeploy pods.

-Xloggc:/opt/dremio/data
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=4000k
-XX:+PrintClassHistogramBeforeFullGC
-XX:+PrintClassHistogramAfterFullGC
-XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/opt/dremio/data
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:MaxGCPauseMillis=500
-XX:InitiatingHeapOccupancyPercent=25
-XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log

Once above is done, send us the GC logs

Thanks
Bali

unni · October 14, 2020, 5:46am

Got the same error on dremio master

2020-10-14 04:43:51,992 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (zk-hs:2181) and timeout (5000) / elapsed (33774)

org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss


Zookeeper logs

2020-10-14 04:31:50,884 [myid:2] - INFO  [ProcessThread(sid:2 cport:-1)::PrepRequestProcessor@648] - Got user-level KeeperException when processing sessionid:0x17495920dce005a type:delete cxid:0x3 zxid:0x3026a1e9b txntype:-1 reqpath:n/a Error Path:/dremio/leader-latch/master/_c_180217a9-9fd6-4b72-975b-40c9640f96dc-latch-0000000403 Error:KeeperErrorCode = NoNode for /dremio/leader-latch/master/_c_180217a9-9fd6-4b72-975b-40c9640f96dc-latch-0000000403

balaji.ramaswamy · October 14, 2020, 7:07am

@unni

Check if the GC logs show full GC and have printed histograms, if yes, send it to us
If not, then check for errors in ZK logs, the ZK log snippet you have sent above seems like an INFO level message

unni · October 21, 2020, 2:38am

@balaji.ramaswamy We have not yet enabled the DREMIO_JAVA_EXTRA_OPTS.
But have the some error logs with final GC printed in the logs

2020-10-20 23:33:07,973 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (zk-hs:2181) and timeout (5000) / elapsed (89762)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
172.38.2.99 - - [20/Oct/2020:23:33:07 +0000] "GET / HTTP/1.1" 200 2522 "-" "kube-probe/1.16"
2020-10-20 23:33:07,974 [zk-curator-8] ERROR ROOT - Dremio is exiting. Node lost its master status.
Dremio is exiting. Node lost its master status.
2020-10-20T23:33:08.828+0000: [Full GC (Ergonomics) [PSYoungGen: 11010048K->9441402K(12317184K)] [ParOldGen: 27306949K->27306919K(27307008K)] 38316997K->36748322K(39624192K), [Metaspace: 158540K->158540K(352256K)], 34.6978057 secs] [Times: user=409.22 sys=0.87, real=34.70 secs]
Heap
 PSYoungGen      total 12317184K, used 9475770K [0x00007f4123b00000, 0x00007f4464500000, 0x00007f4465000000)
  eden space 11010048K, 86% used [0x00007f4123b00000,0x00007f43660cede8,0x00007f43c3b00000)
  from space 1307136K, 0% used [0x00007f43c3b00000,0x00007f43c3b00000,0x00007f4413780000)
  to   space 1295872K, 0% used [0x00007f4415380000,0x00007f4415380000,0x00007f4464500000)
 ParOldGen       total 27307008K, used 27306919K [0x00007f3aa1000000, 0x00007f4123b00000, 0x00007f4123b00000)
  object space 27307008K, 99% used [0x00007f3aa1000000,0x00007f4123ae9f90,0x00007f4123b00000)
 Metaspace       used 158540K, capacity 164083K, committed 350720K, reserved 352256K

balaji.ramaswamy · October 21, 2020, 3:00am

@unni

This hints that we are having a heap/GC issue, are you able to add the parameters, restart, reproduce the issue and get us the GC logs?

unni · October 23, 2020, 4:14pm

gclogs.zip (1.7 MB)
@balaji.ramaswamy Please find the attached GC logs.

Latest error before dremio-master pod restart

2020-10-23 15:56:29,652 [206d0308-bac2-dc4a-dcfd-0fa386ab3700:foreman] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (zk-hs:2181) and timeout (5000) / elapsed (64872)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
	at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
	at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
	at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
	at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:489)
	at org.apache.curator.framework.imps.CreateBuilderImpl$12.call(CreateBuilderImpl.java:786)
	at org.apache.curator.framework.imps.CreateBuilderImpl$12.call(CreateBuilderImpl.java:778)
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
	at org.apache.curator.framework.imps.CreateBuilderImpl.findProtectedNodeInForeground(CreateBuilderImpl.java:775)
	at org.apache.curator.framework.imps.CreateBuilderImpl.access$1200(CreateBuilderImpl.java:44)
	at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:733)
	at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:723)
	at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
	at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:720)
	at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:484)
	at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:474)
	at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:454)
	at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
	at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
	at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217)
	at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232)
	at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:108)
	at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.internalAcquire1Lease(InterProcessSemaphoreV2.java:344)
	at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.acquire(InterProcessSemaphoreV2.java:284)
	at com.dremio.service.coordinator.zk.ZkDistributedSemaphore.acquire(ZkDistributedSemaphore.java:98)
	at com.dremio.service.coordinator.DistributedSemaphore.acquire(DistributedSemaphore.java:32)
	at com.dremio.resource.basic.BasicResourceAllocator.acquireQuerySemaphoreIfNecessary(BasicResourceAllocator.java:218)
	at com.dremio.resource.basic.BasicResourceAllocator.lambda$allocate$0(BasicResourceAllocator.java:84)
	at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
	at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
	at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
	at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:322)
	at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
	at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:66)
	at com.dremio.resource.basic.BasicResourceAllocator.allocate(BasicResourceAllocator.java:83)
	at com.dremio.exec.maestro.ResourceTracker.allocate(ResourceTracker.java:108)
	at com.dremio.exec.maestro.ResourceTracker.<init>(ResourceTracker.java:72)
	at com.dremio.exec.maestro.QueryTrackerImpl.allocateResources(QueryTrackerImpl.java:107)
	at com.dremio.exec.maestro.MaestroServiceImpl.executeQuery(MaestroServiceImpl.java:141)
	at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:340)
	at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:192)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Dremio version : 4.5
We had modified the heap & direct memory of the master because of OutOfMemory issue when using the the values calculated based on _helpers.tpl file.

Current master memory details

 env:
- name: DREMIO_MAX_HEAP_MEMORY_SIZE_MB
  value: "40000"
- name: DREMIO_MAX_DIRECT_MEMORY_SIZE_MB
  value: "12000"

balaji.ramaswamy · October 24, 2020, 7:04pm

@unni

I can see the problem, looks like some of your SQL is having very large plans, would you have some profiles of queries that were running during this time?

Total 528042170 34168556512
, 13.6242108 secs]
39968M->4485M(14976M), 68.4342734 secs]
[Eden: 0.0B(1984.0M)->0.0B(8544.0M) Survivors: 0.0B->0.0B Heap: 39968.0M(40000.0M)->4485.9M(14976.0M)], [Metaspace: 154450K->152236K(237568K)]
2020-10-23T15:55:23.017+0000: 32930.222: [Class Histogram (after full gc):
num #instances #bytes class name

1: 20588666 1317674624 com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheEntry
2: 20588666 1317674624 com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheKey
3: 20660050 991682400 java.util.concurrent.ConcurrentHashMap$Node
4: 2427301 355100656 [C
5: 530 269748392 [Ljava.util.concurrent.ConcurrentHashMap$Node;
6: 1928006 97681680 [Ljava.lang.Object;
7: 2426431 77645792 java.lang.String
8: 1216512 58392576 org.apache.calcite.rex.RexCall
9: 1707496 40979904 com.google.common.collect.RegularImmutableList
10: 402250 35398000 org.apache.calcite.rel.logical.LogicalFilter
11: 17021 11662784 [B
12: 18645 9006488 [Ljava.util.HashMap$Node;
13: 197643 7905720 java.util.ArrayList
14: 134655 6463440 java.util.HashMap$Node
15: 58590 6093360 com.dremio.exec.planner.sql.SqlOperatorImpl
16: 26624 4521232 java.lang.Class

unni · October 28, 2020, 1:01pm

I have shared the profile via mail to dremio@discoursemail.com.

unni · November 2, 2020, 7:53am

e0ef68b5-905f-4f81-9139-017d86ceec6b 2.zip (75.9 KB)

Please find the attached profile. Let me know if anything else can help you debug this.

balaji.ramaswamy · November 2, 2020, 7:58pm

@unni The one you uploaded seems ok, I guess the one that cause the Full GC never completes as the master crashes? Enable “planner.verbose_profile” and see if any of them have a very big logical planning section. I also see you have quite a few reflections that have matched and can cause the plan to grow big. With 4.8 we have default reflections, wherein the planner is very optimized if a certain VDS (not PDS) has all of its columns selected as display columns for a raw reflection.

Can you please try that?

unni · November 3, 2020, 1:28am

We will try out your suggestions. Thank you
But we have set the heap memory to 90GB can a single query be the cause of this crash?

balaji.ramaswamy · November 3, 2020, 7:21am

@unni

I see your heap at 40 GB and you are doing a full GC? Are you able to provide the GC logs when the issue happened at 90 GB, curious if it is the same objects in heap?

unni · November 10, 2020, 10:58am

gclogs_3Nov.zip (1.2 MB) gc-logs-10Nov.zip (3.9 MB)

Please find the GC logs of Nov3 & Nov10.

We are running more thant 150000 queries in a span of 8 hours.

balaji.ramaswamy · November 11, 2020, 8:38am

Thanks @unni

Will review and get back

Topic		Replies	Views
One or more nodes lost connectivity during query	13	3120	November 4, 2019
ExecutionSetupException: One or more nodes lost connectivity during query	4	1892	December 8, 2020
New Instance Failure	28	2173	February 25, 2021
Executor and Master Nodes crash on query	9	1532	January 27, 2022
ExecutionSetupException: One or more nodes lost connectivity during query during reflection creation	1	154	May 13, 2024

Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job

Related topics