Merge Into failing

I am trying to use a MERGE INTO Statement for 2.2 Million records into a table with 39 Million records.
The source dataset is parquet format and the target dataset is Iceberg .After running for ~7 Minutes, I get the following error:

SYSTEM ERROR: IllegalStateException: All in memory entries belong to same group. Replay never converge. Failing the query

SqlOperatorImpl HASH_JOIN
Location 5:0:4
Fragment 5:0

[Error Id: 78dfffcd-b910-4158-b6ba-839a0b7a15ba on coordinator-server2:0]

(java.lang.IllegalStateException) All in memory entries belong to same group. Replay never converge. Failing the query
com.google.common.base.Preconditions.checkState():502
com.dremio.sabot.op.join.vhash.spill.partition.MultiPartition.reset():500
com.dremio.sabot.op.join.vhash.spill.replay.JoinReplayer.reset():268
com.dremio.sabot.op.join.vhash.spill.replay.JoinReplayer.run():155
com.dremio.sabot.op.join.vhash.spill.replay.JoinRecursiveReplayer.run():50
com.dremio.sabot.op.join.vhash.spill.VectorizedSpillingHashJoinOperator.outputData():630
com.dremio.sabot.driver.SmartOp$SmartDualInput.outputData():416
com.dremio.sabot.driver.StraightPipe.pump():56
com.dremio.sabot.driver.Pipeline.doPump():124
com.dremio.sabot.driver.Pipeline.pumpOnce():114
com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():544
com.dremio.sabot.exec.fragment.FragmentExecutor.run():472
com.dremio.sabot.exec.fragment.FragmentExecutor.access$1700():106
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():978
com.dremio.sabot.task.AsyncTaskWrapper.run():121
com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():249
com.dremio.sabot.task.slicing.SlicingThread.run():171

Hi @sgoldsmith

The error looks like a hash join spill bug fixed in v24. Which version are you on? Can you provide a profile?

@Benny_Chow We are using 24.0.0 . Would a bad dremio.conf file be a source of the issue? We are currently using NAS but didn’t properly change the dremio.conf file’s paths.dist from pdfs://“${paths.local}”/pdfs to file://“${paths.local}”/pdfs that we are in the middle of fixing.

I doubt this is related to dremio.conf settings.

The query executors are running out of memory during hash join and spilling to disk. Whether the hash join can still complete successfully will depend on parameters like total cluster memory
size, number of total operators in the query and current query concurrency etc.

Then it must have been a timing issue around available resources. Now queries that were previously failing are now working as intended.

hi @sgoldsmith, in v24 a new version of hash join was introduced which has the spill capabilities. The above exception is seen only when “exec.op.join.spill” is set to true.(by default the value is false). Was there a change made to this as the query is passing now?

1 Like

exec.op.join.spill was set to True when I first got the error, then I set it to false and it ran successfully, and then I set it back to True and it ran successfully. I will leave the value as false while I conduct more testing.

Could you please share the profiles for all the above 3 runs? It would help analyzing the issue. Thanks!