Dremio master unhealthy

unni · March 3, 2021, 3:39am

We are running Dremio 4.5 on Kubernetes. The master pod becomes unhealthy and does not restart automatically, a manual replace of the pod is required to get the pod in RUNNING state.

Can you suggest some ways to debug this ?

unni · March 3, 2021, 3:46am

gclogs.zip (2.9 MB) Attaching the GC logs taken after the restart.

balaji.ramaswamy · March 3, 2021, 7:54am

@unni

If you see the class histograms, it looks like there is one query plan that is filling up the coordinator heap. Is there a heap dump generated under “/var/log/dremio”? It should have an extension .hprof. If you do find one then load it into Visual VM and run the below OQL and you would get the Job ID’s for the queries running at that time. Open those profiles and see if you can find one with a gigantic plan

select map(heap.objects('com.dremio.exec.work.protector.EnterpriseForemenWorkManager$EnterpriseAttemptManager'),
  function(it) {
    desc = " queryId " + it.queryIdString.value.toString();
    desc = desc + " queryState " + it.state.name.toString();
    desc = desc + " querySql " + it.profileTracker.queryDescription.toString();
    return desc;
  }
)

2021-03-03T01:32:17.792+0000: 46663.642: [Full GC (Allocation Failure) 2021-03-03T01:32:17.792+0000: 46663.642: [Class Histogram (before full gc): 
 num     #instances         #bytes  class name
----------------------------------------------
   1:      36450549     7799446624  [C
   2:     108970076     6974084864  com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheEntry
   3:     108969903     6974073792  com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheKey
   4:     109041219     5233978512  java.util.concurrent.ConcurrentHashMap$Node
   5:           781     2149100208  [Ljava.util.concurrent.ConcurrentHashMap$Node;
   6:      36449855     1166395360  java.lang.String
   7:      24366094     1012287680  [Ljava.lang.Object;
   8:      17821567      855435216  org.apache.calcite.rex.RexCall
   9:       5928054      806215344  org.apache.calcite.rel.rules.MultiJoin
  10:       5931519      711782280  org.apache.calcite.rel.logical.LogicalJoin
  11:      12418499      596087952  org.apache.calcite.rex.RexInputRef

unni · March 3, 2021, 8:22am

No heap dump was generated.

unni · March 3, 2021, 9:22am

We will monitor this and comb for any such query plans.
We are continuously firing queries on Dremio can this be one of the reasons?

Also the job max age days is 1.

balaji.ramaswamy · March 5, 2021, 7:01am

@unni

job max age days should not cause this, here are a few questions

How many sources do you have?
What is the metadata refresh frequency on each?
How many concurrent queries are you firing?
How many cores on the coordinator?
Do your queries do an inline metadata refresh, open profile and see if there is lot of time spent in metadata or open planning tab and see if time is spent on validation or convert to rel or if you scroll down below “Final Physical Transformation”, see PARTIAL_METADATA

unni · March 5, 2021, 4:54pm

How many sources do you have?  
  -- 3 Postgres sources
What is the metadata refresh frequency on each?  
  -- Fetch every one hour
How many concurrent queries are you firing? 
  -- 30 Requests, total of 27000 request in a span of 2-3 hours
How many cores on the coordinator? 
  -- 16 Core 64GB RAM (45 GB Heap)
Do your queries do an inline metadata refresh, open profile and see if there is lot of time spent in metadata or open planning tab and see if time is spent on validation or convert to rel or if you scroll down below “Final Physical Transformation”, see PARTIAL_METADATA
--  Planning time is not high.  I cannot find the keyword PARTIAL_METADATA on the planning page

Topic		Replies	Views
GKE Master disk usage	19	1665	November 27, 2020
Dremio auto shutdown	16	2072	April 12, 2023
Dremio Master Restarting every 4-5hours Dremio University	1	979	October 30, 2022
Query failing because of ZK connection SUSPENDED, RECONNECT results in master killing the job	16	3874	November 11, 2020
Auto restart for master(coordinator) Dremio University	3	970	December 25, 2022

Dremio master unhealthy

Related topics