Dremio auto shutdown

Hi ,
I am still struglling with dremio stablity its auto shutdown not able to access gui

2019-11-20 05:31:01,148 [222b3083-e945-9936-0b2d-c8ae5cddf000:foreman] INFO c.d.e.p.s.h.commands.FragmentStarter - User Error Occurred [ErrorId: a52310dd-4d25-4362-adc9-c647ce78d216]
com.dremio.common.exceptions.UserException: Exceeded timeout (5000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes.
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.startFragments(FragmentStarter.java:157) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.start(FragmentStarter.java:80) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.AsyncCommand.startFragments(AsyncCommand.java:98) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:332) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
2019-11-20 05:31:01,148 [222b3083-b764-ad0e-427f-5ad6b89c1200:foreman] INFO c.d.e.p.s.h.commands.FragmentStarter - User Error Occurred [ErrorId: 00d35199-3b9c-402d-a2ce-76c1931481bf]
com.dremio.common.exceptions.UserException: Exceeded timeout (5000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes.
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.startFragments(FragmentStarter.java:157) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.start(FragmentStarter.java:80) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.AsyncCommand.startFragments(AsyncCommand.java:98) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:332) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]

But in kubernate pods all are in running state

after restart dremio master node it works

1 Like

i need solution i cant always restart Master pods as its a production ready system

@Vikash_Singh

Send us the complete server.log from the master coordinator, few questions

#1 Do you have a second coordinator other than the master? If yes, we do not recommend that
#2 Have you by any chance provisioned all the OS RAM to Dremio, if yes , it could be the oom-killer that is killing Dremio. You need to at least leave 4 GB to the OS
#3 Have set heap on the coordinator explicitly? if you have set MAX then everything minus 2 GB goes to heap and may end up in full GC cycles. Check the Dremio master pod log to see if you see Full GC’s

1 Like

Hello @balaji.ramaswamy,

  1. Do you have a second coordinator other than the master? --> No
  2. Have you by any chance provisioned all the OS RAM
    i have 128 Gb Ram where only 116 GB Ram Assigned to dremio rest all free for OS
  3. Have set heap on the coordinator explicitly?

yes this is value in dremio env

DREMIO_MAX_HEAP_MEMORY_SIZE_MB=10240
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=51200

please find atatche d logs for above error

dremio-master-0.zip (1.1 MB)

@Vikash_Singh

The log file you sent has no Dremio startup or shutdown. Do these logs rollover when you restart Dremio?

yes might be since its a production system i cant stop server and debub ,next time i will provide when i get same issue again i guess every week i faced this type of issue.

can you also guide how to copy full dremio log to local filesystem from kubernate pods as

kubectl logs -f dremio-master-0 > dremio-master-0.log looks hanged for me

@balaji.ramaswamy i am still facing Dremio ui auto shutdown automatically and also see master pods status 0/1
i am attaching master log of issue dremio-master-0.zip (531.4 KB)

Its quite urgent for me i cant able to run dremio to production every day i have to manually restart dremio master pods plese suggest permanent solution

@Vikash_Singh

The attached logfile has nor entries on Dremio shutting down or starting up, Would you be able to send the logfile that has keyword “KVstore” or “localhost”

Thanks
@balaji.ramaswamy

Hello, @balaji.ramaswamy

There is some way to force stop/start worker nodes programmatically?

@Diego Depends on the deployment

  • K8’s and Yarn should restart automatically
  • Standalone VM’s would require a custom script to start on failure

Thanks
Bali

@balaji.ramaswamy I use Dremio AWS version. I’m interested on schedule an hour to stop nodes and hour to start these nodes

@balaji.ramaswamy, I noticed that when I stop the Dremio service, after a few minutes, the executor nodes are also stopped. I set the nodes to auto start so that when the service is turned on they go up together. But, the second engine does not go up

@Diego The second engine is a preview engine that will be used only for queries run in the preview mode, try hitting preview instead of run and see of that helps?

@balaji.ramaswamy When I click on preview the nodes automatically startup? I guess no

@Diego If you hit preview on a query, the engine should automatically startup