A dataset query/reflection that we have used for over a year, has started failing with this error. It is a relatively complex query but I see no reason why it has started to fail. I’ve added debug logging but do not see anything jumping out. I’ve also added further memory to the host and that has not made any difference - can you suggest any logging/investigation that I can do please? We only have one host, there are no separate worker nodes.
2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:5
2023-12-29 14:20:55,804 [Curator-ServiceCache-0] INFO c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER dremioHost:45679 registered itself.
2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:6
2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:7
2023-12-29 14:20:55,807 [1a712ab5-6287-fb08-5cbc-852f6947a000:foreman] DEBUG c.d.exec.work.foreman.AttemptManager - 1a712ab5-6287-fb08-5cbc-852f6947a000: cleaning up.
2023-12-29 14:20:55,807 [1a712ab5-6287-fb08-5cbc-852f6947a000:foreman] ERROR c.d.exec.work.foreman.AttemptManager - ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
com.dremio.common.exceptions.UserException: ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:885)
at com.dremio.exec.work.foreman.AttemptManager$AttemptResult.close(AttemptManager.java:680)
at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:874)
at com.dremio.exec.work.foreman.AttemptManager.access$2400(AttemptManager.java:96)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:925)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:917)
at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:105)
at com.dremio.common.EventProcessor.sendEvent(EventProcessor.java:63)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.addEvent(AttemptManager.java:920)
at com.dremio.exec.work.foreman.AttemptManager.addToEventQueue(AttemptManager.java:930)
at com.dremio.exec.work.foreman.AttemptManager$CompletionListenerImpl.failed(AttemptManager.java:231)
at com.dremio.exec.maestro.FragmentTracker.checkAndUpdateFirstError(FragmentTracker.java:307)
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:203)
at com.dremio.exec.maestro.FragmentTracker$ExecutorNodeStatusListener.nodesUnregistered(FragmentTracker.java:348)
at com.dremio.service.coordinator.LocalExecutorSetService$1$1.nodesUnregistered(LocalExecutorSetService.java:98)
at com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered(AbstractServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints(ZKServiceSet.java:127)
at com.dremio.service.coordinator.zk.ZKServiceSet.access$000(ZKServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged(ZKServiceSet.java:53)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:197)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:193)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent(ServiceCacheImpl.java:191)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530)
at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.dremio.common.exceptions.ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:204)
… 27 common frames omitted
2023-12-29 14:20:55,811 [e3 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:3:1] DEBUG c.d.exec.compile.ClassTransformer - Done compiling (bytecode size=14.9 KiB, time:16 millis).
2023-12-29 14:20:55,820 [Curator-ServiceCache-0] DEBUG c.d.exec.maestro.FragmentTracker - node completed for query 1a712ab5-6287-fb08-5cbc-852f6947a000 from node dremioHost:45679, pending nodes
2023-12-29 14:20:55,829 [Curator-ServiceCache-0] DEBUG c.d.exec.work.foreman.AttemptManager - 1a712ab5-6287-fb08-5cbc-852f6947a000: State change requested FAILED → COMPLETED
2023-12-29 14:20:55,829 [Curator-ServiceCache-0] WARN c.d.exec.work.foreman.AttemptManager - Dropping request to move to COMPLETED state as query 1a712ab5-6287-fb08-5cbc-852f6947a000 is already at FAILED state (which is terminal).
The dataset queries a number of folders of JSON data on an S3 bucket. Other queries on those S3 folders work without issue.
WITH ordered_basket_execs AS (
SELECT e1.dir0 as “BasketDate”,
case
when foMisc9.message.body.FOMisc9 is not null then foMisc9.message.body.FOMisc9
else e1.message.body.BasketName
end as BasketName,
e1.message.body.OrderId, e1.message.body.ClOrdID, e1.message.body.TransactTime, e1.message.body.NoAllocs[0].AllocAccount as “AllocAccount”,
e1.message.body.SecurityDesc, e1.message.body.OrdType[1] as “OrdType”, e1.message.body.TimeInForce[1] as “TimeInForce”,
newOrder.message.body.Text as “Text”, e1.message.body.CumQty, e1.message.body.AvgPx, e1.message.body.Price, e1.message.body.Currency, e1.message.body.OrderQty, e1.message.body.OrdStatus[0] as “OrdStatus”,
e1.message.body.SecurityExchange,
ROW_NUMBER() OVER (PARTITION BY e1.message.body.OrderId ORDER BY Right(e1.message.body.ExecID,20) DESC) AS rn
FROM s3Bucket.execReportsByDateGrouped e1
LEFT JOIN s3Bucket.execReportsByDateGrouped foMisc9
ON e1.message.body.OrderId = foMisc9.message.body.OrderId
AND foMisc9.message.body.FoMisc9 is not null
LEFT JOIN s3Bucket.ordersByDateGrouped newOrder
ON TO_DATE(newOrder.dir0, ‘YYYYMMDD’, 0) >= CURRENT_DATE - INTERVAL ‘5’ DAY
and e1.message.body.ClOrdId = newOrder.message.body.ClOrdId
where TO_DATE(e1.dir0, ‘YYYYMMDD’, 0) >= CURRENT_DATE - INTERVAL ‘5’ DAY
and ((e1.message.body.BasketName is not null) or (foMisc9.message.body.FOMisc9 is not null))
)
SELECT * FROM ordered_basket_execs WHERE rn = 1
AND OrdStatus!=‘4’