Dremio auto shutdown

Hi ,
I am still struglling with dremio stablity its auto shutdown not able to access gui

2019-11-20 05:31:01,148 [222b3083-e945-9936-0b2d-c8ae5cddf000:foreman] INFO c.d.e.p.s.h.commands.FragmentStarter - User Error Occurred [ErrorId: a52310dd-4d25-4362-adc9-c647ce78d216]
com.dremio.common.exceptions.UserException: Exceeded timeout (5000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes.
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.startFragments(FragmentStarter.java:157) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.start(FragmentStarter.java:80) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.AsyncCommand.startFragments(AsyncCommand.java:98) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:332) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]
2019-11-20 05:31:01,148 [222b3083-b764-ad0e-427f-5ad6b89c1200:foreman] INFO c.d.e.p.s.h.commands.FragmentStarter - User Error Occurred [ErrorId: 00d35199-3b9c-402d-a2ce-76c1931481bf]
com.dremio.common.exceptions.UserException: Exceeded timeout (5000) while waiting after sending work fragments to remote nodes. Sent 1 and only heard response back from 0 nodes.
at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:776) ~[dremio-common-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.startFragments(FragmentStarter.java:157) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.FragmentStarter.start(FragmentStarter.java:80) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.planner.sql.handlers.commands.AsyncCommand.startFragments(AsyncCommand.java:98) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at com.dremio.exec.work.foreman.AttemptManager.run(AttemptManager.java:332) [dremio-sabot-kernel-4.0.2-201910020123580864-a98a0b9.jar:4.0.2-201910020123580864-a98a0b9]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [na:1.8.0_222]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [na:1.8.0_222]
at java.lang.Thread.run(Thread.java:748) [na:1.8.0_222]

But in kubernate pods all are in running state

after restart dremio master node it works

1 Like

i need solution i cant always restart Master pods as its a production ready system

@Vikash_Singh

Send us the complete server.log from the master coordinator, few questions

#1 Do you have a second coordinator other than the master? If yes, we do not recommend that
#2 Have you by any chance provisioned all the OS RAM to Dremio, if yes , it could be the oom-killer that is killing Dremio. You need to at least leave 4 GB to the OS
#3 Have set heap on the coordinator explicitly? if you have set MAX then everything minus 2 GB goes to heap and may end up in full GC cycles. Check the Dremio master pod log to see if you see Full GC’s

1 Like

Hello @balaji.ramaswamy,

  1. Do you have a second coordinator other than the master? --> No
  2. Have you by any chance provisioned all the OS RAM
    i have 128 Gb Ram where only 116 GB Ram Assigned to dremio rest all free for OS
  3. Have set heap on the coordinator explicitly?

yes this is value in dremio env

DREMIO_MAX_HEAP_MEMORY_SIZE_MB=10240
DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=51200

please find atatche d logs for above error

dremio-master-0.zip (1.1 MB)

@Vikash_Singh

The log file you sent has no Dremio startup or shutdown. Do these logs rollover when you restart Dremio?

yes might be since its a production system i cant stop server and debub ,next time i will provide when i get same issue again i guess every week i faced this type of issue.

can you also guide how to copy full dremio log to local filesystem from kubernate pods as

kubectl logs -f dremio-master-0 > dremio-master-0.log looks hanged for me

@balaji.ramaswamy i am still facing Dremio ui auto shutdown automatically and also see master pods status 0/1
i am attaching master log of issue dremio-master-0.zip (531.4 KB)

Its quite urgent for me i cant able to run dremio to production every day i have to manually restart dremio master pods plese suggest permanent solution

@Vikash_Singh

The attached logfile has nor entries on Dremio shutting down or starting up, Would you be able to send the logfile that has keyword “KVstore” or “localhost”

Thanks
@balaji.ramaswamy