After upgrading to 25.1.0, Dremio crashes (SIGSEGV)

Hello,

after updating Dremio from 24.3.2 to 25.1.0 we are experiencing crashes (SIGSEGV) when running queries that ran fine before.

We are running Dremio on K8s, installed with the “official” Helm chart.

The Dremio executor crashes with this error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f1a3b472881, pid=1, tid=268
#
# JRE version: OpenJDK Runtime Environment Temurin-11.0.24+8 (11.0.24+8) (build 11.0.24+8)
# Java VM: OpenJDK 64-Bit Server VM Temurin-11.0.24+8 (11.0.24+8, mixed mode, sharing, tiered, compressed oops, g1 gc, linux-amd64)
# Problematic frame:
# C  [libjoust.so+0x72881]  Joust::HashTable::LTable<16, false>::Find(unsigned short const*, int, int, unsigned char const*, unsigned char const*, int const*, int*, Joust::HashTable::NullMaskType, Joust::HashTable::NullMaskValue)+0x191
#
# Core dump will be written. Default location: Core dumps may be processed with "/usr/share/apport/apport -p%p -s%s -c%c -d%d -P%P -u%u -g%g -- %E" (or dumping to /opt/dremio/core.1)
#
# An error report file with more information is saved as:
# /tmp/hs_err_pid1.log

Logs from the master at the same time:

2024-09-10 14:33:26,294 [FABRIC-rpc-event-queue] INFO  c.d.sabot.exec.FragmentExecutors - Received remote fragment start instruction for 191fa749-540c-9a0c-cc51-9991a070c300:0:0 with assigned weight 1 and scheduling weight 1
2024-09-10 14:33:26,298 [e9 - 191fa749-540c-9a0c-cc51-9991a070c300:frag:0:0] INFO  c.d.exec.expr.ExpressionSplitter - Named expression: FunctionHolderExpression [args=[ValueVectorReadExpression [fieldId=TypedFieldId [fieldIds=[0], remainder=null]], FunctionHolderExpression [args=[FunctionHolderExpression [args=[FunctionHolderExpression [args=[ValueExpression[quoted_string=.], ValueVectorReadExpression [fieldId=TypedFieldId [fieldIds=[0], remainder=null]]], name=position, returnType=int32, isRandom=false], ValueExpression[int=1]], name=add, returnType=int32, isRandom=false]], name=castBIGINT, returnType=int64, isRandom=false]], name=substring, returnType=varchar, isRandom=false]
2024-09-10 14:33:26,310 [e9 - 191fa749-540c-9a0c-cc51-9991a070c300:frag:0:0] ERROR com.dremio.sabot.driver.SmartOp - StatusRuntimeException: CANCELLED: Server sendMessage() failed with Error
com.dremio.common.exceptions.UserException: StatusRuntimeException: CANCELLED: Server sendMessage() failed with Error
        at com.dremio.common.exceptions.UserException$Builder.build(UserException.java:984)
        at com.dremio.sabot.driver.SmartOp.contextualize(SmartOp.java:203)
        at com.dremio.sabot.driver.SmartOp$SmartProducer.outputData(SmartOp.java:599)
        at com.dremio.sabot.driver.StraightPipe.pump(StraightPipe.java:55)
        at com.dremio.sabot.driver.Pipeline.doPump(Pipeline.java:134)
        at com.dremio.sabot.driver.Pipeline.pumpOnce(Pipeline.java:124)
        at com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run(FragmentExecutor.java:655)
        at com.dremio.sabot.exec.fragment.FragmentExecutor.run(FragmentExecutor.java:560)
        at com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run(FragmentExecutor.java:1234)
        at com.dremio.sabot.task.AsyncTaskWrapper.run(AsyncTaskWrapper.java:130)
        at com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop(SlicingThread.java:279)
        at com.dremio.sabot.task.slicing.SlicingThread.run(SlicingThread.java:186)
Caused by: io.grpc.StatusRuntimeException: CANCELLED: Server sendMessage() failed with Error
        at io.grpc.Status.asRuntimeException(Status.java:533)
        at io.grpc.stub.ClientCalls$BlockingResponseStream.hasNext(ClientCalls.java:631)
        at com.dremio.exec.store.ischema.writers.TableWriter.write(TableWriter.java:77)
        at com.dremio.exec.store.ischema.InformationSchemaRecordReader.next(InformationSchemaRecordReader.java:102)
        at com.dremio.sabot.op.scan.ScanOperator.outputData(ScanOperator.java:418)
        at com.dremio.sabot.driver.SmartOp$SmartProducer.outputData(SmartOp.java:595)
        ... 9 common frames omitted
2024-09-10 14:33:36,524 [FABRIC-rpc-event-queue] INFO  com.dremio.sabot.exec.MaestroProxy - All queries on executor are active on coordinator. No queries to cancel.

We have now switched back to Dremio 24.3.2 and the queries are working again without any problems.

Is anyone else experiencing these crashes? Any idea how to fix this?

Thank you in advance and best regards,
Nico

1 Like

Hi, Sorry you had to rollback, can you please upload this file /tmp/hs_err_pid1.log as that will help us get to the bottom of the issue

Hi @balaji.ramaswamy,

I’ve sent you the log by PM.

Thanks in advance for taking a look!

Best regards,
Nico

@rybnico Can we please have profile for Job ID# 19152ff2-1d8f-507a-272e-8c37d9a74300

Hi @balaji.ramaswamy,

I’ve sent you the query profile by PM.

Thanks again and best regards,
Nico

@rybnico Thanks, we are suspecting a known issue that Engineering is looking into

Hi @balaji.ramaswamy,

I saw that Dremio 25.1.1 has been released, also as an OSS version. Do you know if this version fixes the problem?

Thanks and best regards,
Nico

@rybnico Yes, 25.1.0 should fix the issue

Hello @balaji.ramaswamy,
unfortunately Dremio 25.1.1 still crashes with our query. I will send you the query profiles and error log by PM.

Best regards,
Nico

@rybnico Sorry to hear that, can you please send the job profile and the hs_err_pid file from the new version. I would like to validate if it is the same issue still

Hi @balaji.ramaswamy,

I have already sent it to you via PM.

Best regards,
Nico