Previously stable dataset has started failing with "One or more nodes lost connectivity during query"

mildlyth · December 29, 2023, 2:33pm

A dataset query/reflection that we have used for over a year, has started failing with this error. It is a relatively complex query but I see no reason why it has started to fail. I’ve added debug logging but do not see anything jumping out. I’ve also added further memory to the host and that has not made any difference - can you suggest any logging/investigation that I can do please? We only have one host, there are no separate worker nodes.

2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:5
2023-12-29 14:20:55,804 [Curator-ServiceCache-0] INFO c.d.s.c.TaskLeaderStatusListener - New Leader node for task MASTER dremioHost:45679 registered itself.
2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:6
2023-12-29 14:20:55,804 [e2 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:1:0] DEBUG com.dremio.sabot.driver.SmartOp - Closing operator: 1:0:7
2023-12-29 14:20:55,807 [1a712ab5-6287-fb08-5cbc-852f6947a000:foreman] DEBUG c.d.exec.work.foreman.AttemptManager - 1a712ab5-6287-fb08-5cbc-852f6947a000: cleaning up.
2023-12-29 14:20:55,807 [1a712ab5-6287-fb08-5cbc-852f6947a000:foreman] ERROR c.d.exec.work.foreman.AttemptManager - ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
com.dremio.common.exceptions.UserException: ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:885)
at com.dremio.exec.work.foreman.AttemptManager$AttemptResult.close(AttemptManager.java:680)
at com.dremio.exec.work.foreman.AttemptManager.moveToState(AttemptManager.java:874)
at com.dremio.exec.work.foreman.AttemptManager.access$2400(AttemptManager.java:96)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:925)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.processEvent(AttemptManager.java:917)
at com.dremio.common.EventProcessor.processEvents(EventProcessor.java:105)
at com.dremio.common.EventProcessor.sendEvent(EventProcessor.java:63)
at com.dremio.exec.work.foreman.AttemptManager$StateSwitch.addEvent(AttemptManager.java:920)
at com.dremio.exec.work.foreman.AttemptManager.addToEventQueue(AttemptManager.java:930)
at com.dremio.exec.work.foreman.AttemptManager$CompletionListenerImpl.failed(AttemptManager.java:231)
at com.dremio.exec.maestro.FragmentTracker.checkAndUpdateFirstError(FragmentTracker.java:307)
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:203)
at com.dremio.exec.maestro.FragmentTracker$ExecutorNodeStatusListener.nodesUnregistered(FragmentTracker.java:348)
at com.dremio.service.coordinator.LocalExecutorSetService$1$1.nodesUnregistered(LocalExecutorSetService.java:98)
at com.dremio.service.coordinator.AbstractServiceSet.nodesUnregistered(AbstractServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet.updateEndpoints(ZKServiceSet.java:127)
at com.dremio.service.coordinator.zk.ZKServiceSet.access$000(ZKServiceSet.java:39)
at com.dremio.service.coordinator.zk.ZKServiceSet$EndpointListener.cacheChanged(ZKServiceSet.java:53)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:197)
at org.apache.curator.x.discovery.details.ServiceCacheImpl$2.apply(ServiceCacheImpl.java:193)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.x.discovery.details.ServiceCacheImpl.childEvent(ServiceCacheImpl.java:191)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:538)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$5.apply(PathChildrenCache.java:532)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:100)
at org.apache.curator.shaded.com.google.common.util.concurrent.DirectExecutor.execute(DirectExecutor.java:30)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:92)
at org.apache.curator.framework.recipes.cache.PathChildrenCache.callListeners(PathChildrenCache.java:530)
at org.apache.curator.framework.recipes.cache.EventOperation.invoke(EventOperation.java:35)
at org.apache.curator.framework.recipes.cache.PathChildrenCache$9.run(PathChildrenCache.java:808)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: com.dremio.common.exceptions.ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [dremioHost:0].
at com.dremio.exec.maestro.FragmentTracker.handleFailedNodes(FragmentTracker.java:204)
… 27 common frames omitted
2023-12-29 14:20:55,811 [e3 - 1a712ab5-6287-fb08-5cbc-852f6947a000:frag:3:1] DEBUG c.d.exec.compile.ClassTransformer - Done compiling (bytecode size=14.9 KiB, time:16 millis).
2023-12-29 14:20:55,820 [Curator-ServiceCache-0] DEBUG c.d.exec.maestro.FragmentTracker - node completed for query 1a712ab5-6287-fb08-5cbc-852f6947a000 from node dremioHost:45679, pending nodes
2023-12-29 14:20:55,829 [Curator-ServiceCache-0] DEBUG c.d.exec.work.foreman.AttemptManager - 1a712ab5-6287-fb08-5cbc-852f6947a000: State change requested FAILED → COMPLETED
2023-12-29 14:20:55,829 [Curator-ServiceCache-0] WARN c.d.exec.work.foreman.AttemptManager - Dropping request to move to COMPLETED state as query 1a712ab5-6287-fb08-5cbc-852f6947a000 is already at FAILED state (which is terminal).

The dataset queries a number of folders of JSON data on an S3 bucket. Other queries on those S3 folders work without issue.

WITH ordered_basket_execs AS (
SELECT e1.dir0 as “BasketDate”,
case
when foMisc9.message.body.FOMisc9 is not null then foMisc9.message.body.FOMisc9
else e1.message.body.BasketName
end as BasketName,
e1.message.body.OrderId, e1.message.body.ClOrdID, e1.message.body.TransactTime, e1.message.body.NoAllocs[0].AllocAccount as “AllocAccount”,
e1.message.body.SecurityDesc, e1.message.body.OrdType[1] as “OrdType”, e1.message.body.TimeInForce[1] as “TimeInForce”,
newOrder.message.body.Text as “Text”, e1.message.body.CumQty, e1.message.body.AvgPx, e1.message.body.Price, e1.message.body.Currency, e1.message.body.OrderQty, e1.message.body.OrdStatus[0] as “OrdStatus”,
e1.message.body.SecurityExchange,
ROW_NUMBER() OVER (PARTITION BY e1.message.body.OrderId ORDER BY Right(e1.message.body.ExecID,20) DESC) AS rn
FROM s3Bucket.execReportsByDateGrouped e1
LEFT JOIN s3Bucket.execReportsByDateGrouped foMisc9
ON e1.message.body.OrderId = foMisc9.message.body.OrderId
AND foMisc9.message.body.FoMisc9 is not null
LEFT JOIN s3Bucket.ordersByDateGrouped newOrder
ON TO_DATE(newOrder.dir0, ‘YYYYMMDD’, 0) >= CURRENT_DATE - INTERVAL ‘5’ DAY
and e1.message.body.ClOrdId = newOrder.message.body.ClOrdId
where TO_DATE(e1.dir0, ‘YYYYMMDD’, 0) >= CURRENT_DATE - INTERVAL ‘5’ DAY
and ((e1.message.body.BasketName is not null) or (foMisc9.message.body.FOMisc9 is not null))
)
SELECT * FROM ordered_basket_execs WHERE rn = 1
AND OrdStatus!=‘4’

balaji.ramaswamy · December 30, 2023, 11:38pm

@mildlyth I remember seeing an issue similar on JSON files. Can you please look at server.out and see if there an error pid file path printed like /some/path/hs_err_<pid> and if yes, can you please attach server.out and that error file?

mildlyth · December 31, 2023, 2:12pm

dremio_logs.zip (561.3 KB)

@balaji.ramaswamy thanks for the reply. No sign of a pid file. I’ve attached server.out and server.log. In this scenario I started the server at 13:53, and attempted to run a select on an s3 json data source - just a select * on the raw dataset, the query failed with the same error.

balaji.ramaswamy · January 1, 2024, 7:13pm

@mildlyth I do see the below

Dremio is exiting. Node lost its master status.
pure virtual method called
terminate called without an active exception

Can you please send me the below

GC logs when the executor went unavailable
/var/log/messages covering the time this query ran
server.log as the one attached only starts at 13:55 while Dremio went down at 13:53

server.log first entry

2023-12-31 13:55:28,977 [e3] DEBUG c.d.scheduler.cfs.MonitoredThreadMQ - Thread 0 got a request from thread 3 asking for work

Dremio start time

Sun Dec 31 13:53:55 UTC 2023 Starting dremio on dremiohost

I want to rule out GC pause, JVM crash and/or oom-kill

Thanks
Bali

mildlyth · January 2, 2024, 10:56am

Hi @balaji.ramaswamy I’ve recreated from scratch. Server starts at 10:44 and I attempt to run the reflection at 10:46:10, it fails after 1:22. I do not have permissions to access /var/log/messages , UPDATE: I have attached the relevant section of the messages file
dremio_logs2.zip (1.1 MB)
messages_sample.zip (781 Bytes)

balaji.ramaswamy · January 14, 2024, 1:10am

@mildlyth In the current logs you sent,

server.out does not have any errors
server.log has only one DEBUG message coming from turning on c.d.s.namespace.NamespaceServiceImpl , Do you have any other DEBUG’s turned on? Can we first diable all DEBUG’s?
server.gc does indicate a problem

2024-01-02T10:46:53.503+0000: 163.556: [Full GC (Allocation Failure) 2024-01-02T10:46:53.503+0000: 163.556: [Class Histogram (before full gc): 
 num     #instances         #bytes  class name
----------------------------------------------
   1:     366713428     5867414848  com.dremio.exec.expr.fn.FunctionErrorContextImpl
   2:        287897     5663089768  [Ljava.lang.Object;
   3:         86574      139835048  [I
   4:        568877       53492872  [B
   5:        294956       28379104  [C

There is a Full GC pause for 30 seconds that will cause the executor to restart

What version of Dremio is this?

mildlyth · February 7, 2024, 3:31pm

Apologies for delay - we were running dremio-community-20.1.0-202202061055110045-36733c65. We installed a new version which has not had this issue, but am now experiencing different problems (new post raised!)

balaji.ramaswamy · February 14, 2024, 11:59pm

@mildlyth

Thanks for the update, this issue is resolved in a later version, I will answer your new post shortly

Thanks
Bali

Topic		Replies	Views
Error - One or more nodes lost connectivity during query	1	950	August 26, 2021
Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job	16	3714	November 11, 2020
ExecutionSetupException: One or more nodes lost connectivity during query during reflection creation	1	128	May 13, 2024
One or more nodes lost connectivity during query	13	3045	November 4, 2019
ExecutionSetupException: One or more nodes lost connectivity during query	4	1812	December 8, 2020

Previously stable dataset has started failing with "One or more nodes lost connectivity during query"

Related Topics