ChannelClosedException - Channel Closed

Hey folks!!

I’m experimenting with Dremio OSS. My setup is the following.

Dremio is deployed through Kubernetes. I have 1 pod for master, 1 for coordinator and 1 for executor.
I’m port-forwarding 31010 to my localhost, and trying to query it through Power BI Desktop (Dremio Software connector).

My data is on s3 > Hive > Dremio. I can successfully query it using the web UI.

When trying to query it in Power BI everything works fine initially, but as soon I start adding new tables or querying something new, Dremio stops responding and I can’t say why this is happening.

The logs show an ERROR c.d.exec.work.foreman.AttemptManager - ChannelClosedException: exception. I’m having a hard time trying to discover why this is happening, can you help me?

I’m attaching a few logs.
extract-2023-12-14T14_02_20.929Z.csv.zip (31.6 KB)

Folks, could this be related to the Apache Drill timeout problem? Gathering around I see the default value is 5 seconds. Can we increase this value by any chance?

When the error I described above happens, what I see in Power BI is the message below

image

Would love any guidance here.

@kleinkauff Let us start with the profile of the job that failed with channel closed, then we need to see if in one of the servers there was a ZK SUSPENDED as master did not respond or there was a Full GC

Can you please send the server.log straight from the coordinator instead of the csv file

Also it would be good to persist the server.log and GC.log by moving them to a persistent disk

JVM settings

By default, Dremio will attempt to adjust the JVM according to some default rules. This applies to all deployment types. The default settings, while often sufficient for testing, may not fit your needs when moving into production.

Dremio calculates the JVM heap (-Xmx) and direct (MaxDirectMemory) memory based on the memory setting in values.yaml in the root of the GitHub repository. For example, you will find the following settings for both coordinator and executors:

# Dremio Coordinator
coordinator:
  # CPU & Memory
  # Memory allocated to each coordinator, expressed in MB.
  # CPU allocated to each coordinator, expressed in CPU cores.
  cpu: 14
  memory: 107374

We can see here that there are 14 CPU cores and 100GB configured for this coordinator. Dremio would typically assign 16G to the JVM heap and the remaining 84GB to direct memory (for the coordinator). For the executor, the JVM heap default is 8GB. It is important to note here that if you wish to configure JVM max heap size, then you must always set the direct memory too.

Otherwise Dremio will still try to configure the default direct memory. This can result in the pod attempting to over allocate memory.

JVM settings can be simply applied under theextraStartParams section, for example

  extraStartParams: >-
     -XX:+UseG1GC
     -XX:+AlwaysPreTouch
     -Xms31g
     -Xmx31g
     -XX:MaxDirectMemorySize=10g

While you are configuring these settings you may as well apply all the settings you need. For example GC logging settings

 -Xloggc:/opt/dremio/data/gc.log
 -XX:NumberOfGCLogFiles=20
 -XX:GCLogFileSize=100m
 -XX:+PrintGCDetails
 -XX:+PrintGCTimeStamps
 -XX:+PrintGCDateStamps
 -XX:+PrintAdaptiveSizePolicy
 -XX:+UseGCLogFileRotation

Other JVM settings are discussed in another tech note (See links below)

Note: there is a common section forextraStartParams towards the end of thevalues.yaml. if you duplicate settings, you may run into some errors starting the pods.

Persisting logs

The logs are typically not written to a persistent volume so are ephemeral and often, the first thing users will do is to redirect these unless they are using a log aggregation endpoint. Again, using the same section as above, the most common form would be:

  -Ddremio.log.path=/opt/dremio/data/log

Also heap dumps and JVM error files

 -XX:HeapDumpPath=/opt/dremio/data
 -XX:ErrorFile=/opt/dremio/data/hs_err_pid%p.log