Dremio scheduler picking up terminated node for query execution causing resource deadlock

Greetings,

Dremio Version: 11
Dremio Type: AWS Edition Free

Problem Statement:
Dremio is trying to contact with dead (terminated) node which was once alive in an engine. The dead node entry stuck somehow (bad scheduler case) in the list of nodes for the same engine which should be empty. Moreover when a query is ran, a new node is spawned (auto off is set) but still the dead one is being contacted with, making a resource dead lock.

Logs:

tracker.json
[scheduler-5] INFO tracker.logger - setBlacklist
[scheduler-5] INFO tracker.logger - blacklist{} engines engine{101.182.1.26:45678}

server.log
[FABRIC] INFO c.d.s.fabric.FabricConnectionManager - [FABRIC]: No connection active, opening new connection to 101.182.1.26:45678.
WARN c.d.telemetry.api.metrics.Metrics - Removing old metric since name matched newly registered metric. Metric name: FABRIC-send-durations-ms
[start-executor41] INFO c.d.p.s.ProvisioningServiceImpl - Action RUNNING on cluster ‘engine’ completed.
[FABRIC-1] ERROR com.dremio.exec.rpc.BasicClient - Failed to establish connection
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:372)
at com.dremio.exec.rpc.BasicClient$ConnectionMultiListener$ConnectionEstablishmentListener.operationComplete(BasicClient.java:284)
at com.dremio.exec.rpc.BasicClient$ConnectionMultiListener$ConnectionEstablishmentListener.operationComplete(BasicClient.java:272)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[FABRIC-1] INFO c.d.services.fabric.FabricClient - [FABRIC]: Channel closed null ↔ null (fabric client)
[FABRIC] WARN c.d.telemetry.api.metrics.Metrics - Removing old metric since name matched newly registered metric. Metric name: FABRIC-send-durations-ms
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] INFO c.d.exec.maestro.FragmentStarter - User Error Occurred [ErrorId: 07ba4f9b-0068-4917-ad2e-70f258f94912]
com.dremio.common.exceptions.UserException: Error setting up remote fragment execution
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:804)
at com.dremio.exec.maestro.FragmentStarter.startFragments(FragmentStarter.java:215)
at com.dremio.exec.maestro.FragmentStarter.start(FragmentStarter.java:109)
at com.dremio.exec.maestro.QueryTrackerImpl.startFragments(QueryTrackerImpl.java:149)
at com.dremio.exec.maestro.MaestroServiceImpl.executeQuery(MaestroServiceImpl.java:171)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:365)
at com.dremio.context.RequestContext.run(RequestContext.java:95)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.dremio.exec.rpc.ConnectionFailedException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.ConnectionFailedException.mapException(ConnectionFailedException.java:31)
at com.dremio.exec.rpc.ListeningCommand.connectionFailed(ListeningCommand.java:65)
at com.dremio.services.fabric.FabricServiceImpl$ProxyCommand.connectionFailed(FabricServiceImpl.java:206)
at com.dremio.exec.rpc.ReconnectingConnection$ConnectionRunner.executeCommand(ReconnectingConnection.java:324)
at com.dremio.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:118)
at com.dremio.services.fabric.FabricServiceImpl$CommandRunner.runCommand(FabricServiceImpl.java:183)
at com.dremio.exec.work.rpc.CoordToExecTunnel.startFragments(CoordToExecTunnel.java:47)
at com.dremio.exec.service.executor.ExecutorServiceProductClient.startFragments(ExecutorServiceProductClient.java:51)
at com.dremio.exec.maestro.FragmentStarter.sendStartFragments(FragmentStarter.java:298)
at com.dremio.exec.maestro.FragmentStarter.startFragments(FragmentStarter.java:177)
… 9 common frames omitted
Caused by: com.dremio.exec.rpc.RpcException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.RpcException.mapException(RpcException.java:89)
… 18 common frames omitted
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] ERROR c.d.exec.maestro.FragmentTracker - Failure while attempting to CANCEL fragments of query 1fdb8c75-6b5b-b296-917e-04f6db4bed00 on endpoint 101.182.1.26:0 with {}.
com.dremio.exec.rpc.ConnectionFailedException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.ConnectionFailedException.mapException(ConnectionFailedException.java:31)
at com.dremio.exec.rpc.ListeningCommand.connectionFailed(ListeningCommand.java:65)
at com.dremio.services.fabric.FabricServiceImpl$ProxyCommand.connectionFailed(FabricServiceImpl.java:206)
at com.dremio.exec.rpc.ReconnectingConnection$ConnectionRunner.executeCommand(ReconnectingConnection.java:324)
at com.dremio.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:118)
at com.dremio.services.fabric.FabricServiceImpl$CommandRunner.runCommand(FabricServiceImpl.java:183)
at com.dremio.exec.work.rpc.CoordToExecTunnel.cancelFragments(CoordToExecTunnel.java:57)
at com.dremio.exec.service.executor.ExecutorServiceProductClient.cancelFragments(ExecutorServiceProductClient.java:65)
at com.dremio.exec.maestro.FragmentTracker.cancelExecutingFragmentsInternal(FragmentTracker.java:247)
at com.dremio.exec.maestro.FragmentTracker.cancelExecutingFragments(FragmentTracker.java:231)
at com.dremio.exec.maestro.QueryTrackerImpl.cancel(QueryTrackerImpl.java:179)
at com.dremio.exec.maestro.MaestroServiceImpl.cancelQuery(MaestroServiceImpl.java:183)
at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:796)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:380)
at com.dremio.context.RequestContext.run(RequestContext.java:95)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.dremio.exec.rpc.RpcException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.RpcException.mapException(RpcException.java:89)
… 18 common frames omitted
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] WARN c.d.exec.work.foreman.AttemptManager - Dropping request to move to FAILED state as query is already at FAILED state (which is terminal).

@shdkpr2008

There should be a job profile associated with this failure, can you please send that to us?

@balaji.ramaswamy

Sharing a query profile might be exposing some confidential information, can you please be specific in what information will be useful from the profile? I will try to provided it separately.

Below is some of the information from profile:

Job profile infomration
Execution Plan: Executor Selection (0 ms)
idealNumFragments: 2
idealNumNodes : 1
numExecutors : 2
details : preview:1,engine:2,blacklist(0): selectedEndpoints: 101.182.1.26:45678, 101.122.1.29:45678 hardAffinity: false
Execution Plan: Fragment Assignment (4 ms)

Error ID: 07ba4f9b-0068-4917-ad2e-70f258f94912

I think that means your engine node is blacklisted. Can you post a screenshot or list of engine nodes from earlier in the Engines admin section. Might that IP address the only running master coordinator instance of Dremio? I have a similar issue mine says “preview” instead of “engine” and is a total resource deadlock outage this morning Dremio 13.latest AWS Edition.

blacklist{} engines preview{10.12.66.94:45679}
tracker.logger - setBlacklist

I’m going to try https://docs.dremio.com/rest-api/nodeCollections/view-denied-nodes.html?h=blacklist also

No, the blacklist was empty and after engines the “engine” is the name of the engine.

I don’t have the exact scenario currently, what I did was to rename the engine name from “engine” to “engine_n”, that kicked new instances provisioning and all and that worked. Although the old “engine” was up and got cleared over course of time.

The IP address was that of a terminated node, and each time the request for forcing a new instance provisioned, but due to the dead node reference in the engine list, newer once were ignored.