Greetings,
Dremio Version: 11
Dremio Type: AWS Edition Free
Problem Statement:
Dremio is trying to contact with dead (terminated) node which was once alive in an engine. The dead node entry stuck somehow (bad scheduler case) in the list of nodes for the same engine which should be empty. Moreover when a query is ran, a new node is spawned (auto off is set) but still the dead one is being contacted with, making a resource dead lock.
Logs:
tracker.json
[scheduler-5] INFO tracker.logger - setBlacklist
[scheduler-5] INFO tracker.logger - blacklist{} engines engine{101.182.1.26:45678}
server.log
[FABRIC] INFO c.d.s.fabric.FabricConnectionManager - [FABRIC]: No connection active, opening new connection to 101.182.1.26:45678.
WARN c.d.telemetry.api.metrics.Metrics - Removing old metric since name matched newly registered metric. Metric name: FABRIC-send-durations-ms
[start-executor41] INFO c.d.p.s.ProvisioningServiceImpl - Action RUNNING on cluster ‘engine’ completed.
[FABRIC-1] ERROR com.dremio.exec.rpc.BasicClient - Failed to establish connection
java.util.concurrent.ExecutionException: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
at io.netty.util.concurrent.DefaultPromise.get(DefaultPromise.java:372)
at com.dremio.exec.rpc.BasicClient$ConnectionMultiListener$ConnectionEstablishmentListener.operationComplete(BasicClient.java:284)
at com.dremio.exec.rpc.BasicClient$ConnectionMultiListener$ConnectionEstablishmentListener.operationComplete(BasicClient.java:272)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:577)
at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:570)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:549)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:490)
at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:615)
at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:608)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.fulfillConnectPromise(AbstractNioChannel.java:321)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:337)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[FABRIC-1] INFO c.d.services.fabric.FabricClient - [FABRIC]: Channel closed null ↔ null (fabric client)
[FABRIC] WARN c.d.telemetry.api.metrics.Metrics - Removing old metric since name matched newly registered metric. Metric name: FABRIC-send-durations-ms
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] INFO c.d.exec.maestro.FragmentStarter - User Error Occurred [ErrorId: 07ba4f9b-0068-4917-ad2e-70f258f94912]
com.dremio.common.exceptions.UserException: Error setting up remote fragment execution
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:804)
at com.dremio.exec.maestro.FragmentStarter.startFragments(FragmentStarter.java:215)
at com.dremio.exec.maestro.FragmentStarter.start(FragmentStarter.java:109)
at com.dremio.exec.maestro.QueryTrackerImpl.startFragments(QueryTrackerImpl.java:149)
at com.dremio.exec.maestro.MaestroServiceImpl.executeQuery(MaestroServiceImpl.java:171)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:365)
at com.dremio.context.RequestContext.run(RequestContext.java:95)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.dremio.exec.rpc.ConnectionFailedException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.ConnectionFailedException.mapException(ConnectionFailedException.java:31)
at com.dremio.exec.rpc.ListeningCommand.connectionFailed(ListeningCommand.java:65)
at com.dremio.services.fabric.FabricServiceImpl$ProxyCommand.connectionFailed(FabricServiceImpl.java:206)
at com.dremio.exec.rpc.ReconnectingConnection$ConnectionRunner.executeCommand(ReconnectingConnection.java:324)
at com.dremio.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:118)
at com.dremio.services.fabric.FabricServiceImpl$CommandRunner.runCommand(FabricServiceImpl.java:183)
at com.dremio.exec.work.rpc.CoordToExecTunnel.startFragments(CoordToExecTunnel.java:47)
at com.dremio.exec.service.executor.ExecutorServiceProductClient.startFragments(ExecutorServiceProductClient.java:51)
at com.dremio.exec.maestro.FragmentStarter.sendStartFragments(FragmentStarter.java:298)
at com.dremio.exec.maestro.FragmentStarter.startFragments(FragmentStarter.java:177)
… 9 common frames omitted
Caused by: com.dremio.exec.rpc.RpcException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.RpcException.mapException(RpcException.java:89)
… 18 common frames omitted
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] ERROR c.d.exec.maestro.FragmentTracker - Failure while attempting to CANCEL fragments of query 1fdb8c75-6b5b-b296-917e-04f6db4bed00 on endpoint 101.182.1.26:0 with {}.
com.dremio.exec.rpc.ConnectionFailedException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.ConnectionFailedException.mapException(ConnectionFailedException.java:31)
at com.dremio.exec.rpc.ListeningCommand.connectionFailed(ListeningCommand.java:65)
at com.dremio.services.fabric.FabricServiceImpl$ProxyCommand.connectionFailed(FabricServiceImpl.java:206)
at com.dremio.exec.rpc.ReconnectingConnection$ConnectionRunner.executeCommand(ReconnectingConnection.java:324)
at com.dremio.exec.rpc.ReconnectingConnection.runCommand(ReconnectingConnection.java:118)
at com.dremio.services.fabric.FabricServiceImpl$CommandRunner.runCommand(FabricServiceImpl.java:183)
at com.dremio.exec.work.rpc.CoordToExecTunnel.cancelFragments(CoordToExecTunnel.java:57)
at com.dremio.exec.service.executor.ExecutorServiceProductClient.cancelFragments(ExecutorServiceProductClient.java:65)
at com.dremio.exec.maestro.FragmentTracker.cancelExecutingFragmentsInternal(FragmentTracker.java:247)
at com.dremio.exec.maestro.FragmentTracker.cancelExecutingFragments(FragmentTracker.java:231)
at com.dremio.exec.maestro.QueryTrackerImpl.cancel(QueryTrackerImpl.java:179)
at com.dremio.exec.maestro.MaestroServiceImpl.cancelQuery(MaestroServiceImpl.java:183)
at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:796)
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:380)
at com.dremio.context.RequestContext.run(RequestContext.java:95)
at com.dremio.common.concurrent.ContextMigratingExecutorService.lambda$decorate$3(ContextMigratingExecutorService.java:199)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: com.dremio.exec.rpc.RpcException: Command failed while establishing connection. Failure type CONNECTION.
at com.dremio.exec.rpc.RpcException.mapException(RpcException.java:89)
… 18 common frames omitted
Caused by: io.netty.channel.AbstractChannel$AnnotatedNoRouteToHostException: No route to host: /101.182.1.26:45678
Caused by: java.net.NoRouteToHostException: No route to host
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:714)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at java.lang.Thread.run(Thread.java:748)
[1fdb8c75-6b5b-b296-917e-04f6db4bed00:foreman] WARN c.d.exec.work.foreman.AttemptManager - Dropping request to move to FAILED state as query is already at FAILED state (which is terminal).