Merge Into failing

sgoldsmith · June 30, 2023, 11:04pm

I am trying to use a MERGE INTO Statement for 2.2 Million records into a table with 39 Million records.
The source dataset is parquet format and the target dataset is Iceberg .After running for ~7 Minutes, I get the following error:

SYSTEM ERROR: IllegalStateException: All in memory entries belong to same group. Replay never converge. Failing the query

SqlOperatorImpl HASH_JOIN
Location 5:0:4
Fragment 5:0

[Error Id: 78dfffcd-b910-4158-b6ba-839a0b7a15ba on coordinator-server2:0]

(java.lang.IllegalStateException) All in memory entries belong to same group. Replay never converge. Failing the query
com.google.common.base.Preconditions.checkState():502
com.dremio.sabot.op.join.vhash.spill.partition.MultiPartition.reset():500
com.dremio.sabot.op.join.vhash.spill.replay.JoinReplayer.reset():268
com.dremio.sabot.op.join.vhash.spill.replay.JoinReplayer.run():155
com.dremio.sabot.op.join.vhash.spill.replay.JoinRecursiveReplayer.run():50
com.dremio.sabot.op.join.vhash.spill.VectorizedSpillingHashJoinOperator.outputData():630
com.dremio.sabot.driver.SmartOp$SmartDualInput.outputData():416
com.dremio.sabot.driver.StraightPipe.pump():56
com.dremio.sabot.driver.Pipeline.doPump():124
com.dremio.sabot.driver.Pipeline.pumpOnce():114
com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():544
com.dremio.sabot.exec.fragment.FragmentExecutor.run():472
com.dremio.sabot.exec.fragment.FragmentExecutor.access$1700():106
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():978
com.dremio.sabot.task.AsyncTaskWrapper.run():121
com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():249
com.dremio.sabot.task.slicing.SlicingThread.run():171

Benny_Chow · July 6, 2023, 6:01pm

Hi @sgoldsmith

The error looks like a hash join spill bug fixed in v24. Which version are you on? Can you provide a profile?

sgoldsmith · July 6, 2023, 6:34pm

@Benny_Chow We are using 24.0.0 . Would a bad dremio.conf file be a source of the issue? We are currently using NAS but didn’t properly change the dremio.conf file’s paths.dist from pdfs://“${paths.local}”/pdfs to file://“${paths.local}”/pdfs that we are in the middle of fixing.

Benny_Chow · July 6, 2023, 6:55pm

I doubt this is related to dremio.conf settings.

The query executors are running out of memory during hash join and spilling to disk. Whether the hash join can still complete successfully will depend on parameters like total cluster memory
size, number of total operators in the query and current query concurrency etc.

sgoldsmith · July 6, 2023, 9:01pm

Then it must have been a timing issue around available resources. Now queries that were previously failing are now working as intended.

prashanthb · July 7, 2023, 1:57pm

hi @sgoldsmith, in v24 a new version of hash join was introduced which has the spill capabilities. The above exception is seen only when “exec.op.join.spill” is set to true.(by default the value is false). Was there a change made to this as the query is passing now?

sgoldsmith · July 7, 2023, 4:03pm

exec.op.join.spill was set to True when I first got the error, then I set it to false and it ran successfully, and then I set it back to True and it ran successfully. I will leave the value as false while I conduct more testing.

prashanthb · July 7, 2023, 4:40pm

Could you please share the profiles for all the above 3 runs? It would help analyzing the issue. Thanks!

Topic		Replies	Views
Dremio OutOfMemoryException	4	174	August 12, 2024
Reflection creation issues	2	1456	March 15, 2018
Dremio Joing 2 parquet files always Fails	4	1435	February 9, 2018
Union query error	4	2418	July 14, 2018
Execute query failed: java.util.concurrent.ExecutionException: java.lang.Exception: QueryFailed: DATA_WRITE ERROR: Failed to spill to disk. Please check space availability	1	968	May 31, 2021

Merge Into failing

Related topics