Dremio master unhealthy

We are running Dremio 4.5 on Kubernetes. The master pod becomes unhealthy and does not restart automatically, a manual replace of the pod is required to get the pod in RUNNING state.

Can you suggest some ways to debug this ?

gclogs.zip (2.9 MB) Attaching the GC logs taken after the restart.

@unni

If you see the class histograms, it looks like there is one query plan that is filling up the coordinator heap. Is there a heap dump generated under “/var/log/dremio”? It should have an extension .hprof. If you do find one then load it into Visual VM and run the below OQL and you would get the Job ID’s for the queries running at that time. Open those profiles and see if you can find one with a gigantic plan

select map(heap.objects('com.dremio.exec.work.protector.EnterpriseForemenWorkManager$EnterpriseAttemptManager'),
  function(it) {
    desc = " queryId " + it.queryIdString.value.toString();
    desc = desc + " queryState " + it.state.name.toString();
    desc = desc + " querySql " + it.profileTracker.queryDescription.toString();
    return desc;
  }
)

2021-03-03T01:32:17.792+0000: 46663.642: [Full GC (Allocation Failure) 2021-03-03T01:32:17.792+0000: 46663.642: [Class Histogram (before full gc): 
 num     #instances         #bytes  class name
----------------------------------------------
   1:      36450549     7799446624  [C
   2:     108970076     6974084864  com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheEntry
   3:     108969903     6974073792  com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheKey
   4:     109041219     5233978512  java.util.concurrent.ConcurrentHashMap$Node
   5:           781     2149100208  [Ljava.util.concurrent.ConcurrentHashMap$Node;
   6:      36449855     1166395360  java.lang.String
   7:      24366094     1012287680  [Ljava.lang.Object;
   8:      17821567      855435216  org.apache.calcite.rex.RexCall
   9:       5928054      806215344  org.apache.calcite.rel.rules.MultiJoin
  10:       5931519      711782280  org.apache.calcite.rel.logical.LogicalJoin
  11:      12418499      596087952  org.apache.calcite.rex.RexInputRef

No heap dump was generated.

We will monitor this and comb for any such query plans.
We are continuously firing queries on Dremio can this be one of the reasons?

Also the job max age days is 1.

@unni

job max age days should not cause this, here are a few questions

  • How many sources do you have?
  • What is the metadata refresh frequency on each?
  • How many concurrent queries are you firing?
  • How many cores on the coordinator?
  • Do your queries do an inline metadata refresh, open profile and see if there is lot of time spent in metadata or open planning tab and see if time is spent on validation or convert to rel or if you scroll down below “Final Physical Transformation”, see PARTIAL_METADATA
How many sources do you have?  
  -- 3 Postgres sources
What is the metadata refresh frequency on each?  
  -- Fetch every one hour
How many concurrent queries are you firing? 
  -- 30 Requests, total of 27000 request in a span of 2-3 hours
How many cores on the coordinator? 
  -- 16 Core 64GB RAM (45 GB Heap)
Do your queries do an inline metadata refresh, open profile and see if there is lot of time spent in metadata or open planning tab and see if time is spent on validation or convert to rel or if you scroll down below “Final Physical Transformation”, see PARTIAL_METADATA
--  Planning time is not high.  I cannot find the keyword PARTIAL_METADATA on the planning page