@kleinkauff Let us start with the profile of the job that failed with channel closed, then we need to see if in one of the servers there was a ZK SUSPENDED as master did not respond or there was a Full GC
Can you please send the server.log straight from the coordinator instead of the csv file
Also it would be good to persist the server.log and GC.log by moving them to a persistent disk
By default, Dremio will attempt to adjust the JVM according to some default rules. This applies to all deployment types. The default settings, while often sufficient for testing, may not fit your needs when moving into production.
Dremio calculates the JVM heap (-Xmx) and direct (MaxDirectMemory) memory based on the memory setting in values.yaml in the root of the GitHub repository. For example, you will find the following settings for both coordinator and executors:
# Dremio Coordinator
# CPU & Memory
# Memory allocated to each coordinator, expressed in MB.
# CPU allocated to each coordinator, expressed in CPU cores.
We can see here that there are 14 CPU cores and 100GB configured for this coordinator. Dremio would typically assign 16G to the JVM heap and the remaining 84GB to direct memory (for the coordinator). For the executor, the JVM heap default is 8GB. It is important to note here that if you wish to configure JVM max heap size, then you must always set the direct memory too.
Otherwise Dremio will still try to configure the default direct memory. This can result in the pod attempting to over allocate memory.
JVM settings can be simply applied under the
extraStartParams section, for example
While you are configuring these settings you may as well apply all the settings you need. For example GC logging settings
Other JVM settings are discussed in another tech note (See links below)
Note: there is a common section for
extraStartParams towards the end of the
values.yaml. if you duplicate settings, you may run into some errors starting the pods.
The logs are typically not written to a persistent volume so are ephemeral and often, the first thing users will do is to redirect these unless they are using a log aggregation endpoint. Again, using the same section as above, the most common form would be:
Also heap dumps and JVM error files