Dremio Joing 2 parquet files always Fails

Hi,

The dremio is provisioned the yarn executors on mapr cluster. The memory given is 32Gb, and there are total of 4 executors. I am trying to create reflection on 30GB parquet files, however this always fails with the below error. Any ideas?

OUT_OF_MEMORY ERROR: One or more nodes ran out of memory while executing the query.

Memory failed due to not enough memory to sort even one batch of records.
Target Batch Size (in bytes) 1227600
Target Batch Size 1023
Batches spilled in failed run 0
Records spilled in failed run 0
Records to spill in current iteration of failed run 0
Records spilled in current iteration of failed run 0
Max batch size spilled in current iteration of failed run 0
Spilled Batch Schema of failed run null
Initial capacity for current iteration of failed run 0
Spill copy allocator-Name null
Spill copy allocator-Allocated memory 0
Spill copy allocator-Max allowed 0
Spill copy allocator-Init Reservation 0
Spill copy allocator-Peak allocated 0
Spill copy allocator-Head room 0
Disk runs 0
Spill count 0
Merge count 0
Max batch size amongst all disk runs 0
Disk run copy allocator-Name null
Disk run copy allocator-Allocated memory 0
Disk run copy allocator-Max allowed 0
Disk run copy allocator-Init Reservation 0
Disk run copy allocator-Peak allocated 0
Disk run copy allocator-Head room 0
Sort allocator-Name op:1:105:2:ExternalSort
Sort allocator-Allocated memory 196608
Sort allocator-Max allowed 63161283
Sort allocator-Init Reservation 20000000
Sort allocator-Peak allocated 88932352
Sort allocator-Head room 62964675
OOM Events Event: Failed to Reserve copy space for spill in MemoryRun failedInitReservation 63397888 failedMaxAllocation 63397888 previousAllocatedMemory 0 previousMaxAllocation 65536 previousInitReservation 65536 previousHeadRoom 65536
SqlOperatorImpl EXTERNAL_SORT
Location 1:105:2
Fragment 1:105

[Error Id: 8d15bc8e-631f-44fc-920e-e597bb923a40 on mtl-prep-mapr-03.preprod.stp.local:-1]

  (org.apache.arrow.memory.OutOfMemoryException) Memory failed due to not enough memory to sort even one batch of records.
    com.dremio.sabot.op.sort.external.ExternalSortOperator.rotateRuns():279
    com.dremio.sabot.op.sort.external.ExternalSortOperator.consumeData():172
    com.dremio.sabot.driver.SmartOp$SmartSingleInput.consumeData():229
    com.dremio.sabot.driver.StraightPipe.pump():59
    com.dremio.sabot.driver.Pipeline.doPump():82
    com.dremio.sabot.driver.Pipeline.pumpOnce():72
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():288
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():284
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1595
    com.dremio.sabot.exec.fragment.FragmentExecutor.run():243
    com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():83
    com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():577
    com.dremio.sabot.task.AsyncTaskWrapper.run():92
    com.dremio.sabot.task.slicing.SlicingThread.run():71

@ravi.eze could you share the profile for the failed reflection creation job?

here is the profile
db7dc6cb-7e02-4152-9a5c-b015d2d32002.zip (87.0 KB)

Does this get fixed with the restart of the nodemanager? because the job runs on Yarn.

In another case… I tried joining 2 parquet files (30GB, 109 GB). My dremio server is 8cores, 32GB. What I observed here was that the join didnt complete even after 7hrs… and throughout the process the load average on the dremio box was 56. Looked like arrow writer is taking all the time. With this i assumed that all data is being bought to the dremio box and then the join is happening. But i have drill installed on the mapr cluster. Is there a way for me to keep the dremio light and let it use the drill which is available on mapr cluster? Profile is here. This case had 4 executors, provisioned via Yarn, 4 core each, 32GB each as memory.

3acedce0-c0b4-41cb-a708-1109bcd8e8cc.zip (2.9 MB)

were my answers relevant?

Hey @ravi.eze, since we’re already tracking this on the Support Portal, our engineers are already taking a look. We’ll be following up there.