Dump thread java when execution Gandiva Execution

hi team dremio

I’m using dremio version 4.7.0 and java oracle jdk 1.8.0_261

when I execute a query related to gandiva based then on node execution appear error “ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were”

check log on execution then see thread dump jvm makes dremio process restart

dremio[18563]: Creating gandiva cache with capacity: 500

A fatal error has been detected by the Java Runtime Environment:

SIGILL (0x4) at pc=0x00007fb4036290c0, pid=18563, tid=0x00007fb3f10d0700

JRE version: Java™ SE Runtime Environment (8.0_261-b12) (build 1.8.0_261-b12)

Java VM: Java HotSpot™ 64-Bit Server VM (25.261-b12 mixed mode linux-amd64 compressed oops)

Problematic frame:

C 0x00007fb4036290c0

Core dump written. Default location: //core or core.18563

An error report file with more information is saved as:

/tmp/hs_err_pid18563.log

If you would like to submit a bug report, please visit:

http://bugreport.java.com/bugreport/crash.jsp

The crash happened outside the Java Virtual Machine in native code.

See problematic frame for where to report the bug

please help me … tks very much team hs_err_pid10254.log.zip (37.3 KB)

@flykent

We have fixed this issue in Dremio version 4.7.2 or 4.7.3, please download, upgrade and validate

Thanks
Bali

hi @balaji.ramaswamy

I have upgraded to version 4.7.3 but … :(( The error still occurs when running the query using gandiva




Screen Shot 2020-08-31 at 18.56.25

Hi,

Can you please provide the query profile

Thanks
Bali

hi @balaji.ramaswamy

i am currently using an Intel ® Xeon ® Silver 4216 CPU that supports the AVX512 instruction flags … i guess the java dump is due to the AVX512 flags incompatible with gandiva

ea6358f4-a544-4ee4-872f-19a8a37e8b29.zip (12.7 KB)

hi team , @balaji.ramaswamy

Is this problem resolved or is it the fault of dremio

Thanks

@flykent

Sorry for the delay, the profile you sent shows 4.7.0?

  • Dremio Version:4.7.0-202008140043000133-4b5158db
  • Submission Version:4.7.0-202008140043000133-4b5158db

hi @balaji.ramaswamy

yes! both version profiles (4.7.0 & 4.7.3) are the same content … the problem is that both dremio versions (4.7.0 & 4.7.3) are faulty on

I have debugged now and noticed that jdk oracle 1.8 version only supports MaxVectorSize 30bit but gandiva arrow execution uses 512bit AVX512 … this leads to incompatibility causing java thread dump

I need a practical solution for this error but the fact that both versions have the same hardware incompatibility error … this makes a new hardware upgrade related to intel hindered by bug of dremio

thanks 20b21ef8-15d1-ba63-55ba-1e159c7ccd00-4.7.3.zip (15.2 KB)

@flykent

Let me look into this and get back to you

Thanks
Bali

Thanks for the response … looking forward to the results from the team … currently the dremio application has great potential for public service to our corporation … only having this problem.

@balaji.ramaswamy

thanks

@balaji.ramaswamy

This error can be solved by the dremio team ??? 1 week has passed !!!

@flykent

Sorry about the delay, we will look at this issue this week

Thanks
Bali

@flykent

One more request, would you also be able to upload the hs_err_pid file?

Thanks
Bali

hi @balaji.ramaswamy

send you java dump file … Hope to get support

thanks hs_err_pid10254.log.zip (37.3 KB)

Hello @flykent

Thanks for uploading the hs_err_pid file, will update in the next couple of days

Thanks
Bali

@flykent,

Can you please let us know a couple of questions here?

  1. What kind of deployment is your Dremio environment? Are you with Kubernetes or VM?
  2. Can you run the query again removing the following part and see if query runs successfully?
    ndv(cast("rundate" as date)) as "count_distinct_rundate"

hi @Ye_Li @balaji.ramaswamy

My dremio implementation model looks like this:

We have 2 clusters running dremio on both kubernet and vm, both of which use the hypervisor of virtualized kvm on intel hardware (Intel ® Xeon ® Silver 4216 CPU)

now I have re-run the query using the failed gandiva and remove the ndv (cast (“rundate” as date)) as “count_distinct_rundate” part according to your instructions on both kubernet and vm … the result is still dumped java thread

I send you the attachment dump java thread and job fail below

9e3e36b6-4721-4c0f-9651-e4668c7701bd.zip (14.8 KB)

hs_err_pid6367.log.zip (32.8 KB)

thanks