Dremio Master Restarting every 4-5hours

Dremio master restarting every 4–5 hours. noticing a lot of Zoo Keeper related messages in logs when this occurs… but unsure what is happening when it goes into this state. It doesn’t restart on it’s own… The Kubernetes readiness probe is falling long enough for Kubernetes to eventually restart the pod. but before that happens, we get a 503 when trying to access the Dremio UI. I am also noticing out of memory log entries.
Has anyone seen this type of issue before?

@Victor From the symptoms described, it looks like a Full GC pause. Are you able to send us the GC logs and server.log. If GC is not configured, below are the steps

Kubernetes Deployment

V1 Helm charts (deprecated) - dremio-cloud-tools/charts/dremio at master · dremio/dremio-cloud-tools

Open dremio-master.yaml and dremio-executor.yaml under the templates directory. Add the following under the DREMIO_JAVA_EXTRA_OPTS section:

1-Xloggc:/opt/dremio/data/gc.log 2-XX:+UseGCLogFileRotation 3-XX:NumberOfGCLogFiles=5 4-XX:GCLogFileSize=4000k 5-XX:+PrintGCDetails 6-XX:+PrintGCTimeStamps 7-XX:+PrintGCDateStamps 8-XX:+PrintClassHistogramBeforeFullGC 9-XX:+PrintClassHistogramAfterFullGC 10-XX:+HeapDumpOnOutOfMemoryError 11-XX:HeapDumpPath=/opt/dremio/data 12-XX:+UseG1GC 13-XX:G1HeapRegionSize=32M 14-XX:MaxGCPauseMillis=500 15-XX:InitiatingHeapOccupancyPercent=25 16-XX:+PrintAdaptiveSizePolicy 17-XX:+PrintReferenceGC 18-XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log

One must also change the helm chart to get the following two values added to dremio-env file

1DREMIO_LOG_TO_CONSOLE=0 2DREMIO_GC_LOG_TO_CONSOLE="no"

V2 Helm charts – dremio-cloud-tools/charts/dremio_v2 at master · dremio/dremio-cloud-tools

Open values.yaml and add the following under the appropriate section; executor and/or coordinator:

1extraStartParams: >- 2 -Xloggc:/opt/dremio/data/gc.log 3 -XX:+UseGCLogFileRotation 4 -XX:NumberOfGCLogFiles=5 5 -XX:GCLogFileSize=4000k 6 -XX:+PrintGCDetails 7 -XX:+PrintGCTimeStamps 8 -XX:+PrintGCDateStamps 9 -XX:+PrintClassHistogramBeforeFullGC 10 -XX:+PrintClassHistogramAfterFullGC 11 -XX:+HeapDumpOnOutOfMemoryError 12 -XX:HeapDumpPath=/opt/dremio/data 13 -XX:+UseG1GC 14 -XX:G1HeapRegionSize=32M 15 -XX:MaxGCPauseMillis=500 16 -XX:InitiatingHeapOccupancyPercent=25 17 -XX:+PrintAdaptiveSizePolicy 18 -XX:+PrintReferenceGC 19 -XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log

one must also change the helm chart to get the following two values added to dremio-env file

1DREMIO_LOG_TO_CONSOLE=0 2DREMIO_GC_LOG_TO_CONSOLE="no"