We are running Dremio 4.5 on Kubernetes. The master pod becomes unhealthy and does not restart automatically, a manual replace of the pod is required to get the pod in RUNNING state.
Can you suggest some ways to debug this ?
We are running Dremio 4.5 on Kubernetes. The master pod becomes unhealthy and does not restart automatically, a manual replace of the pod is required to get the pod in RUNNING state.
Can you suggest some ways to debug this ?
gclogs.zip (2.9 MB) Attaching the GC logs taken after the restart.
If you see the class histograms, it looks like there is one query plan that is filling up the coordinator heap. Is there a heap dump generated under “/var/log/dremio”? It should have an extension .hprof. If you do find one then load it into Visual VM and run the below OQL and you would get the Job ID’s for the queries running at that time. Open those profiles and see if you can find one with a gigantic plan
select map(heap.objects('com.dremio.exec.work.protector.EnterpriseForemenWorkManager$EnterpriseAttemptManager'),
function(it) {
desc = " queryId " + it.queryIdString.value.toString();
desc = desc + " queryState " + it.state.name.toString();
desc = desc + " querySql " + it.profileTracker.queryDescription.toString();
return desc;
}
)
2021-03-03T01:32:17.792+0000: 46663.642: [Full GC (Allocation Failure) 2021-03-03T01:32:17.792+0000: 46663.642: [Class Histogram (before full gc):
num #instances #bytes class name
----------------------------------------------
1: 36450549 7799446624 [C
2: 108970076 6974084864 com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheEntry
3: 108969903 6974073792 com.dremio.exec.planner.sql.handlers.CachingRelMetadataProvider$CacheKey
4: 109041219 5233978512 java.util.concurrent.ConcurrentHashMap$Node
5: 781 2149100208 [Ljava.util.concurrent.ConcurrentHashMap$Node;
6: 36449855 1166395360 java.lang.String
7: 24366094 1012287680 [Ljava.lang.Object;
8: 17821567 855435216 org.apache.calcite.rex.RexCall
9: 5928054 806215344 org.apache.calcite.rel.rules.MultiJoin
10: 5931519 711782280 org.apache.calcite.rel.logical.LogicalJoin
11: 12418499 596087952 org.apache.calcite.rex.RexInputRef
No heap dump was generated.
We will monitor this and comb for any such query plans.
We are continuously firing queries on Dremio can this be one of the reasons?
Also the job max age days is 1.
job max age days should not cause this, here are a few questions
How many sources do you have?
-- 3 Postgres sources
What is the metadata refresh frequency on each?
-- Fetch every one hour
How many concurrent queries are you firing?
-- 30 Requests, total of 27000 request in a span of 2-3 hours
How many cores on the coordinator?
-- 16 Core 64GB RAM (45 GB Heap)
Do your queries do an inline metadata refresh, open profile and see if there is lot of time spent in metadata or open planning tab and see if time is spent on validation or convert to rel or if you scroll down below “Final Physical Transformation”, see PARTIAL_METADATA
-- Planning time is not high. I cannot find the keyword PARTIAL_METADATA on the planning page