Dremio build: 2.0.5-201806021755260067-767cfb5-mapr, Community Edition.
1 Master pod: 64 GB RAM, 8 Core, functions as coordinator and executor, Heap Memory- 4GB, Direct Memory- 59 GB, zk.client.session.timeout: 90000 (in dremio.conf), Configured to connect to external zk quorum with 3 nodes
3 executor pods: 64 GB RAM, 8 Core, functions executor, Heap Memory- 4GB, Direct Memory- 59 GB, zk.client.session.timeout: 90000 (in dremio.conf), Configured to connect to external zk quorum with 3 nodes
Issue:
Log from Master pod:
2018-08-23 15:08:46,002 [Curator-ServiceCache-0] WARN c.d.exec.work.foreman.QueryManager - Nodes [10.244.8.37:-1] no longer registered in cluster. Canceling query 248132e8-9e82-0d9a-8afa-4561c42f3300 2018-08-23 15:08:46,040 [Curator-ServiceCache-0] INFO query.logger - {“queryId”:“248132e8-9e82-0d9a-8afa-4561c42f3300”,“schema”: ,“start”:1535036694087,“finish”:1535036926028,“outcome”:“FAILED”,“commandDescription”:“execute; query”} 2018-08-23 15:08:46,101 [Curator-ServiceCache-0] ERROR c.d.exec.work.foreman.AttemptManager - ForemanException: One more more nodes lost connectivity during query. Identified nodes were [10.244.8.37:-1]. com.dremio.common.exceptions.UserException: ForemanException: One more more nodes lost connectivity during query. Identified nodes were [10.244.8.37:-1].
Log from slave pod which got zk connection suspended [10.244.8.37:-1].
2018-08-23 15:06:11,534 [FABRIC-rpc-event-queue] WARN c.d.s.fabric.FabricMessageHandler - Message of mode REQUEST for protocol 3 of rpc type 3 took longer than 500ms. Actual duration was 3273ms.
2018-08-23 15:09:36,594 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
2018-08-23 15:09:37,351 [FABRIC-1] WARN c.d.services.fabric.FabricServer - Message of mode REQUEST of rpc type 1 took longer than 500ms. Actual duration was 946ms.
Help Needed:
Any configuration to avoid the zk connection getting suspended?
Any Configuration in master to increase the slave node’s zk connection loss check [Since slave nodes zk connections are getting reconnected, would it be feasible to wait instead of master cancelling the running job? ]
Any configuration to avoid the zk connection getting suspended?
Usually problems with ZK timeouts (suspending connection) are related to memory pressure in the process itself or on the machine. This can be checked by enabling gc logging.
Regarding timeout value - unless ZK servers set with at least 90 sec. timeout on their side ZK client timeout is min(ZKClientSetTimeout, ZKServerSetTimeout). I believe by default timeout set on ZKServer side is 40 sec.
Any Configuration in master to increase the slave node’s zk connection loss check
It is related to the above and we don’t have much wiggle room there
Facing a similar issue Dremio 4.5 deployed on AKS.
The dremio-master pod gets restarted because of health check failures.
We are running 1 master, 2 executor and 3 zookeeper pods. Zookeeper resources are the default configurations provided in the helm chart.
2020-09-23 07:06:07,639 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to SUSPENDED
2020-09-23 07:06:09,537 [Curator-ConnectionStateManager-0] INFO c.d.s.coordinator.zk.ZKClusterClient - ZK connection state changed to RECONNECTED
2020-09-23 07:05:39,615 [Curator-ServiceCache-0] INFO c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER
#1 Zookeeper pod had an issue, for this we need to review the ZK pod logs #2 Dremio coordinator (master) had a Full GC. To get to the bottom of this, we need to write GC logs and Histograms to a PVC so even though we lost the pod we have the information
Kubernetes: Open dremio-master.yaml and dremio-executor.yaml under templates
Add below under the DREMIO_JAVA_EXTRA_OPTS section, redeploy pods.
@balaji.ramaswamy We have not yet enabled the DREMIO_JAVA_EXTRA_OPTS.
But have the some error logs with final GC printed in the logs
2020-10-20 23:33:07,973 [Curator-Framework-0] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (zk-hs:2181) and timeout (5000) / elapsed (89762)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:835)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:809)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$300(CuratorFrameworkImpl.java:64)
at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:267)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
172.38.2.99 - - [20/Oct/2020:23:33:07 +0000] "GET / HTTP/1.1" 200 2522 "-" "kube-probe/1.16"
2020-10-20 23:33:07,974 [zk-curator-8] ERROR ROOT - Dremio is exiting. Node lost its master status.
Dremio is exiting. Node lost its master status.
2020-10-20T23:33:08.828+0000: [Full GC (Ergonomics) [PSYoungGen: 11010048K->9441402K(12317184K)] [ParOldGen: 27306949K->27306919K(27307008K)] 38316997K->36748322K(39624192K), [Metaspace: 158540K->158540K(352256K)], 34.6978057 secs] [Times: user=409.22 sys=0.87, real=34.70 secs]
Heap
PSYoungGen total 12317184K, used 9475770K [0x00007f4123b00000, 0x00007f4464500000, 0x00007f4465000000)
eden space 11010048K, 86% used [0x00007f4123b00000,0x00007f43660cede8,0x00007f43c3b00000)
from space 1307136K, 0% used [0x00007f43c3b00000,0x00007f43c3b00000,0x00007f4413780000)
to space 1295872K, 0% used [0x00007f4415380000,0x00007f4415380000,0x00007f4464500000)
ParOldGen total 27307008K, used 27306919K [0x00007f3aa1000000, 0x00007f4123b00000, 0x00007f4123b00000)
object space 27307008K, 99% used [0x00007f3aa1000000,0x00007f4123ae9f90,0x00007f4123b00000)
Metaspace used 158540K, capacity 164083K, committed 350720K, reserved 352256K
2020-10-23 15:56:29,652 [206d0308-bac2-dc4a-dcfd-0fa386ab3700:foreman] ERROR org.apache.curator.ConnectionState - Connection timed out for connection string (zk-hs:2181) and timeout (5000) / elapsed (64872)
org.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
at org.apache.curator.ConnectionState.checkTimeouts(ConnectionState.java:225)
at org.apache.curator.ConnectionState.getZooKeeper(ConnectionState.java:94)
at org.apache.curator.CuratorZookeeperClient.getZooKeeper(CuratorZookeeperClient.java:117)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.getZooKeeper(CuratorFrameworkImpl.java:489)
at org.apache.curator.framework.imps.CreateBuilderImpl$12.call(CreateBuilderImpl.java:786)
at org.apache.curator.framework.imps.CreateBuilderImpl$12.call(CreateBuilderImpl.java:778)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
at org.apache.curator.framework.imps.CreateBuilderImpl.findProtectedNodeInForeground(CreateBuilderImpl.java:775)
at org.apache.curator.framework.imps.CreateBuilderImpl.access$1200(CreateBuilderImpl.java:44)
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:733)
at org.apache.curator.framework.imps.CreateBuilderImpl$11.call(CreateBuilderImpl.java:723)
at org.apache.curator.RetryLoop.callWithRetry(RetryLoop.java:109)
at org.apache.curator.framework.imps.CreateBuilderImpl.pathInForeground(CreateBuilderImpl.java:720)
at org.apache.curator.framework.imps.CreateBuilderImpl.protectedPathInForeground(CreateBuilderImpl.java:484)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:474)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:454)
at org.apache.curator.framework.imps.CreateBuilderImpl.forPath(CreateBuilderImpl.java:44)
at org.apache.curator.framework.recipes.locks.StandardLockInternalsDriver.createsTheLock(StandardLockInternalsDriver.java:54)
at org.apache.curator.framework.recipes.locks.LockInternals.attemptLock(LockInternals.java:217)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.internalLock(InterProcessMutex.java:232)
at org.apache.curator.framework.recipes.locks.InterProcessMutex.acquire(InterProcessMutex.java:108)
at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.internalAcquire1Lease(InterProcessSemaphoreV2.java:344)
at org.apache.curator.framework.recipes.locks.InterProcessSemaphoreV2.acquire(InterProcessSemaphoreV2.java:284)
at com.dremio.service.coordinator.zk.ZkDistributedSemaphore.acquire(ZkDistributedSemaphore.java:98)
at com.dremio.service.coordinator.DistributedSemaphore.acquire(DistributedSemaphore.java:32)
at com.dremio.resource.basic.BasicResourceAllocator.acquireQuerySemaphoreIfNecessary(BasicResourceAllocator.java:218)
at com.dremio.resource.basic.BasicResourceAllocator.lambda$allocate$0(BasicResourceAllocator.java:84)
at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:125)
at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:69)
at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:78)
at com.google.common.util.concurrent.MoreExecutors$DirectExecutorService.execute(MoreExecutors.java:322)
at java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:134)
at com.google.common.util.concurrent.AbstractListeningExecutorService.submit(AbstractListeningExecutorService.java:66)
at com.dremio.resource.basic.BasicResourceAllocator.allocate(BasicResourceAllocator.java:83)
at com.dremio.exec.maestro.ResourceTracker.allocate(ResourceTracker.java:108)
at com.dremio.exec.maestro.ResourceTracker.<init>(ResourceTracker.java:72)
at com.dremio.exec.maestro.QueryTrackerImpl.allocateResources(QueryTrackerImpl.java:107)
at com.dremio.exec.maestro.MaestroServiceImpl.executeQuery(MaestroServiceImpl.java:141)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:340)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:192)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Dremio version : 4.5
We had modified the heap & direct memory of the master because of OutOfMemory issue when using the the values calculated based on _helpers.tpl file.
I can see the problem, looks like some of your SQL is having very large plans, would you have some profiles of queries that were running during this time?
Total 528042170 34168556512
, 13.6242108 secs]
39968M->4485M(14976M), 68.4342734 secs]
[Eden: 0.0B(1984.0M)->0.0B(8544.0M) Survivors: 0.0B->0.0B Heap: 39968.0M(40000.0M)->4485.9M(14976.0M)], [Metaspace: 154450K->152236K(237568K)]
2020-10-23T15:55:23.017+0000: 32930.222: [Class Histogram (after full gc):
num #instances#bytes class name
@unni The one you uploaded seems ok, I guess the one that cause the Full GC never completes as the master crashes? Enable “planner.verbose_profile” and see if any of them have a very big logical planning section. I also see you have quite a few reflections that have matched and can cause the plan to grow big. With 4.8 we have default reflections, wherein the planner is very optimized if a certain VDS (not PDS) has all of its columns selected as display columns for a raw reflection.
I see your heap at 40 GB and you are doing a full GC? Are you able to provide the GC logs when the issue happened at 90 GB, curious if it is the same objects in heap?