Query exceeded memory limits

I’m having issues with some reflections crashing with the error: “Query was cancelled because it exceeded the memory limits set by the administrator”

Can you please help me troubleshoot what is going on?

Attached query profiles:
tc_operator_work_history_rep.zip (145.5 KB)
mtms_service_request_items.zip (78.3 KB)
This one is very odd, since is a VDS with only two columns and only 8 days worth of data, so I don’t understand how it can exceed the memory limit:
gt_latency_check.zip (101.6 KB)

Thank you very much for the help.

@martinocando

For the last query, there were a few queries on Dremio when this query ran OOM, if you add up all the actual it should sum up to the actual on the ROOT allocator, if not add up the peak. similarly you can check the other 2 jobs

Allocator(ROOT) 0/23622266368/23636203136/23622320128 (res/actual/peak/limit) numChildAllocators:12

Allocator(query-1fff9cc4-46b1-4e26-ee5a-69fc6ad88800) 0/8953235264/9284486040/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1fff9b9b-caaa-dac0-acc4-66042611f200) 0/5384800192/5388537472/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1fff3681-14f6-0ee4-7535-023ce98d2b01) 0/5091755792/6536967236/15032385536 (res/actual/peak/limit) numChildAllocators:37
Allocator(query-1fff9bdd-55e8-412f-fca4-ed27da448500) 0/1969635200/2324823808/15032385536 (res/actual/peak/limit) numChildAllocators:5
Allocator(query-1ffe5d80-bb9d-515c-92b2-11ee7d38b600) 0/1222837104/1223059368/15032385536 (res/actual/peak/limit) numChildAllocators:7
Allocator(query-1ffe5d7f-e8a2-8927-c841-fc347c0e4400) 0/559598656/654668096/15032385536 (res/actual/peak/limit) numChildAllocators:4
Allocator(query-1fff2bf8-b471-3721-a6c6-6c9810845a00) 0/65536/623976312/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1ffecbed-acf5-74c2-d4f0-4b1182fb0900) 0/65536/592038040/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1fff16ec-3799-54d7-ce7e-8ae913c90400) 0/65536/592056144/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1fff0d69-d734-9cb7-3753-0b89f817ae00) 0/49152/511220424/15032385536 (res/actual/peak/limit) numChildAllocators:1
Allocator(query-1ffe5d7f-6ce8-4d92-2df6-a4c2d96ace00) 0/334629056/334814400/15032385536 (res/actual/peak/limit) numChildAllocators:2
Allocator(query-1fff3644-eb34-831c-9fde-6edcc8060500) 0/32768/600452360/15032385536 (res/actual/peak/limit) numChildAllocators:2

To see how much memory each node has consumed, go to the jobs page and then click on the job. On the right pane of the window, click on profile, scroll down to nodes and expand nodes

Thanks Balaji.
Very useful information to help us troubleshoot our cluster. Overall there were a lot of queries and reflections failing with the same error, then one of the nodes got into a wedged state and had to be rebooted, and when that happened, all jobs that were failing started to run without issues.
Apparently we have some kind of memory leak.

Any advice on how can we troubleshoot for this? Maybe look into some logs?

Thanks

@martinocando

It seems like you are running on 4 executors, so either reduce concurrency or increase # of executors. One other thing you can do is upgrade to the latest version and if this issue happens, look at server.gc logs to see why there was a Full GC. That will tell us if this is just a capacity issue

Thanks
Bali

1 Like

I had this same problem on part of my reflections, and in my case it was due to ElasticSearch field nesting. I compared to the other reflections that were still working and the difference was on two nested fields in Elasticsearch (i.e aaa.bbb), that were not present in the working reflections. I unnested these fields and it solved my problem.