One or more nodes lost connectivity during query

I have a custom plugin. If the query time is much longer (around 5 mins), it will report the error. Seems ZK cannot get the heartbeat. Do you have any idea?

SYSTEM ERROR: ForemanException: One or more nodes lost connectivity during query.  Identified nodes were [xxx:-1].



  (com.dremio.exec.work.foreman.ForemanException) One or more nodes lost connectivity during query.  Identified nodes were [xxx:-1].
    com.dremio.exec.work.foreman.QueryManager$3.nodesUnregistered():660
    com.dremio.exec.work.foreman.AttemptManager.nodesUnregistered():187
    com.dremio.exec.work.protector.Foreman.nodesUnregistered():154
    com.dremio.exec.work.protector.ForemenWorkManager$NodeStatusListener.nodesUnregistered():294
    com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered():39
    com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints():127
    com.dremio.service.coordinator.zk.ZKServiceSet.access$000():39
    com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged():53
    org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply():177
    org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply():173
    org.apache.curator.framework.listen.ListenerContainer$1.run():93
    org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute():297
    org.apache.curator.framework.listen.ListenerContainer.forEach():85
    org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent():171
    org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply():522
    org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply():516
    org.apache.curator.framework.listen.ListenerContainer$1.run():93
    org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute():297
    org.apache.curator.framework.listen.ListenerContainer.forEach():85
    org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners():514
    org.apache.curator.framework.recipes.cache.EventOperation.invoke():35
    org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run():773
    java.util.concurrent.Executors$RunnableAdapter.call():511
    java.util.concurrent.FutureTask.run():266
    java.util.concurrent.Executors$RunnableAdapter.call():511
    java.util.concurrent.FutureTask.run():266
    java.util.concurrent.ThreadPoolExecutor.runWorker():1149
    java.util.concurrent.ThreadPoolExecutor$Worker.run():624
    java.lang.Thread.run():748

More logs:

2019-10-30 06:52:12,014 [Curator-ServiceCache-0] WARN  c.d.e.w.protector.ForemenWorkManager - Foreman 2246cd46-49c6-320b-f6bd-26f6e2ecbb00 failed to handle unregistered nodes [address: "xxx.dev.net"
user_port: -1
fabric_port: 46011
roles {
  sql_query: false
  java_executor: true
  master: false
}
start_time: 1572417900659
provision_id: "container_e86_1572151378458_0126_01_000002"
max_direct_memory: 60129542144
available_cores: 8
]
java.lang.RuntimeException: Exceptions caught during event processing
        at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:116) ~[dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.common.EventProcessor.sendEvent(EventProcessor.java:63) ~[dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.addEvent(AttemptManager.java:714) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.addToEventQueue(AttemptManager.java:724) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$CompletionListenerImpl.succeeded(AttemptManager.java:169) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager.nodeComplete(QueryManager.java:561) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager.access$900(QueryManager.java:81) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager$NodeTracker.fragmentComplete(QueryManager.java:511) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager$NodeTracker.nodeDead(QueryManager.java:529) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager$3.nodesUnregistered(QueryManager.java:666) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.nodesUnregistered(AttemptManager.java:187) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.protector.Foreman.nodesUnregistered(Foreman.java:154) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.protector.ForemenWorkManager$NodeStatusListener.nodesUnregistered(ForemenWorkManager.java:294) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered(AbstractServiceSet.java:39) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints(ZKServiceSet.java:127) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet.access$000(ZKServiceSet.java:39) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged(ZKServiceSet.java:53) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:177) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:173) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [curator-client-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent(ServiceCacheImpl.java:171) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [curator-client-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) [curator-recipes-2.12.0.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_212]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]
Caused by: java.lang.IllegalStateException: null
        at com.google.common.base.Preconditions.checkState(Preconditions.java:429) ~[guava-20.0.jar:na]
        at com.dremio.exec.work.foreman.AttemptManager$AttemptResult.close(AttemptManager.java:513) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:693) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.access$1900(AttemptManager.java:86) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:719) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:711) ~[dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:105) ~[dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        ... 37 common frames omitted
2019-10-30 06:52:12,015 [Curator-ServiceCache-0] WARN  c.d.exec.work.foreman.QueryManager - Nodes [xxx.dev.net:-1] no longer registered in cluster.  Canceling query 2246ce7d-34d1-74bf-4644-37a6bd7a4d00
2019-10-30 06:52:12,017 [Curator-ServiceCache-0] ERROR c.d.exec.work.foreman.AttemptManager - ForemanException: One or more nodes lost connectivity during query.  Identified nodes were [xxx.dev.net:-1].
com.dremio.common.exceptions.UserException: ForemanException: One or more nodes lost connectivity during query.  Identified nodes were [xxx.dev.net:-1].
        at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:773) ~[dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$AttemptResult.close(AttemptManager.java:540) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:669) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.access$1900(AttemptManager.java:86) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:719) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:711) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:105) [dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.common.EventProcessor.sendEvent(EventProcessor.java:63) [dremio-common-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.addEvent(AttemptManager.java:714) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.addToEventQueue(AttemptManager.java:724) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager$CompletionListenerImpl.failed(AttemptManager.java:174) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.QueryManager$3.nodesUnregistered(QueryManager.java:659) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.foreman.AttemptManager.nodesUnregistered(AttemptManager.java:187) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.protector.Foreman.nodesUnregistered(Foreman.java:154) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.exec.work.protector.ForemenWorkManager$NodeStatusListener.nodesUnregistered(ForemenWorkManager.java:294) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered(AbstractServiceSet.java:39) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints(ZKServiceSet.java:127) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet.access$000(ZKServiceSet.java:39) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged(ZKServiceSet.java:53) [dremio-services-coordinator-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:177) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:173) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [curator-client-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent(ServiceCacheImpl.java:171) [curator-x-discovery-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:522) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:516) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:93) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.shaded.com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297) [curator-client-2.12.0.jar:na]
        at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:85) [curator-framework-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:514) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35) [curator-recipes-2.12.0.jar:na]
        at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:773) [curator-recipes-2.12.0.jar:na]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_212]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) [na:1.8.0_212]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_212]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_212]
        at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]
Caused by: com.dremio.exec.work.foreman.ForemanException: One or more nodes lost connectivity during query.  Identified nodes were [xxx.dev.net:-1].
        at com.dremio.exec.work.foreman.QueryManager$3.nodesUnregistered(QueryManager.java:660) [dremio-sabot-kernel-3.2.4-201906051751050278-1bcce62.jar:3.2.4-201906051751050278-1bcce62]
        ... 28 common frames omitted


@popejune

Typically this error means that the executor node mentioned in the error message (“xxx”) did not respond within a certain time. This could be because it ran out of heap or was doing a Full GC. If this is not Yarn then check the log folder to check if there is a heap dump. Also check the server.gc file to see if there were any Full GC’s at that time. If this is a Kuberbetes deployment then use kubectl to get logs and should have GC information. If Yarn then GC/Heap is not enabled by default and we need to enable them

@balaji.ramaswamy, thanks for your response.

It’s using Yarn. How to verify it’s due to Full GC? Could you let me know hwo to enable it in Yarn?

BTW, per my estimate, the query should use 2G mem. But we set the worker mem to 32G. Why it trigger full GC? Is any query job running on a specific JVM and there is heap setting for every query job? If yes, could you let me know how to set the heep size for every job?

@popejune

For Yarn even though worker memory is 32 GB, only 4 GB would go to heap. If you need to increase heap on the container, you have to pass below via admin-provisioning-edit-additional properties-add option

Type: java
Name: provisioning.yarn.heapsize
Value:

To enable GC collection and heap dumps set below via same screen

Type: System
Name: -XX:+PrintGCDetails
Value:

Type: System
Name: -XX:+PrintGCTimeStamps
Value:

Type: System
Name: -XX:+PrintGCDateStamps
Value:

Type: System
Name: -XX:+HeapDumpOnOutOfMemoryError
Value:

Type: System
Name: -XX:HeapDumpPath
Value:

@balaji.ramaswamy
Thanks. Let me try.

Another question, I set the Worker mem to 64G on one node. Why the totoal Mem is only 34G?

@popejune

Where did you set 64 GB? How did you find out it is only 34 GB

The way you do it is to set memory per container via admin-provisioing-edit-memory per container

Then once the containers are up, log on to the data node (Dremio executors) and do a “ps -ef | grep dremio” and see how much direct memory is allocated, it would be (memory per container)-(heap, usually 4)

Thanks
@balaji.ramaswamy

@balaji.ramaswamy

I set this from Dremio UI -> Admin -> Cluster -> Provisioning -> YARN

From the screenshot below, I set 8 cores and 64GB, but the worker only has 1 cores and 34GB. Actually, the node have sufficient mem.
image

BTW, where is the admin-provisioing-edit-memory file? I did not find it from the Dremio doc.

@popejune

Is there a Q limit on RM?

Thanks
@balaji.ramaswamy

@balaji.ramaswamy
We limit the queue to 20% (total 500G), even we increate the limit to 35%, it still only use 34G without any change.

@popejune

Are all queries single threaded? Also what happens if you ask for say 2 executors with each 4 cores and 16 GB of RAM

Can you send me a screen shot of entire Yarn settings after clicking the pencil icon next to trash can?

@balaji.ramaswamy

What you mean “single threaded”? how to config it?

FYI
image

@popejune

Run a query, click on the jobs page, and the job for that query, open the profile and see if the scan is parallelized

I also see you are only requesting 1 worker

Thanks
@balaji.ramaswamy

@balaji.ramaswamy
Thanks for you response. I listed all my questions above which I did not get answer.

  1. How to enable the scan to parallel?
  2. Yes, we are using 1 worker for testing, but the config means 1 worker should have 64G memory . Is there any memory limitation (34G) for one worker?
  3. where is the admin-provisioing-edit-memory file, currently I have to config the properties on Dremio UI -> Admin -> Cluster -> Provisioning -> YARN
  1. How to enable the scan to parallel? This happens automatically based on number of rows, number of files to be scanned and number of cores available per host
  2. Yes, we are using 1 worker for testing, but the config means 1 worker should have 64G memory . Is there any memory limitation (34G) for one worker? - There is no memory limitation. Did you by any chance also install Dremio locally on the worker node?
  3. where is the admin-provisioing-edit-memory file, currently I have to config the properties on Dremio UI -> Admin -> Cluster -> Provisioning -> YARN - That is the only way to do it now