New Instance Failure

Hey Guys,

I just installed Dremio on 3 VM’s. Running verison 12.1.0-202101041749050132-55c827cb

One coordinator and Two executor.

I’ve connected it up to a couple databases and run a query but the executors keeps on dying with the following message.

ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [vdw-executor01:0]

I’ve got another cluster that’s running on version 4.0 and the queries works on it without any issues.

I’ve got no idea where to start looking, I’ve tried everything from using the basic config values to defining all the values. Hopfully someone is willing to help out.

-------------- vdw-coordinator01 --------------------

Summary

`paths: {
local: “/opt/dremio/data”
dist: “pdfs://”{paths.local}"/pdfs" db: {paths.local}/db,
spilling: [{paths.local}/spill] accelerator: {paths.dist}/accelerator
downloads: {paths.dist}/downloads uploads: {paths.dist}/uploads
results: {paths.dist}/results scratch: {paths.dist}/scratch
}

services: {
coordinator: {
enabled: true,
auto-upgrade: false,

master: {
  enabled: true,
  embedded-zookeeper: {
    enabled: true,
    port: 2181,
    path: ${paths.local}/zk
  }
},

web: {
  enabled: true,
  port: 9047,
  auth: {
    type: "internal"
  }
  ui: {
    # Configuration for Intercom
    intercom: {
        enabled: true
        appid:  "@dremio.ui.intercom.appid@"
    }
  }
  tokens: {
    cache: {
      size: 100
      expiration_minutes: 5
    }
  }
},

client-endpoint: {
  port: 31010
},

scheduler: {
  threads: 24
},

command-pool: {
  enabled: true
  size: 0 # 0 defaults to the machine's number of cores
}

},

executor: {
enabled: false
},

flight: {
enabled: true,
port: 32010,
auth.mode: “arrow.flight.auth2”
}

fabric: {
port: 45678,

memory: {
  reservation: 100M
}

},
web-admin: {
enabled: true,
# Port, bound to loopback interface, on which the daemon responds to liveness HTTP requests (0 == auto-allocated)
port: 0
}
}

zookeeper: “localhost:”${services.coordinator.master.embedded-zookeeper.port}
zk.client.session.timeout: 1800000
registration.publish-host: “vdw-coordinator01”`

-------------- vdw-executor01 --------------------

Summary

paths: {
local: “/opt/dremio/data”
dist: “pdfs://”{paths.local}"/pdfs" db: {paths.local}/db,
spilling: [{paths.local}/spill] accelerator: {paths.dist}/accelerator
downloads: {paths.dist}/downloads uploads: {paths.dist}/uploads
results: {paths.dist}/results scratch: {paths.dist}/scratch
}

services: {
coordinator: {
enabled: false,
auto-upgrade: false

master: {
  enabled: false
},

web: {
  enabled: false
},

client-endpoint: {
  port: 31010
},

scheduler: {
  threads: 24
},

command-pool: {
  enabled: true
  size: 0 # 0 defaults to the machine's number of cores
}

},

executor: {
enabled: true
},

flight: {
enabled: true,
port: 32010,
auth.mode: “arrow.flight.auth2”
}

fabric: {
port: 45678,

memory: {
  reservation: 100M
}

},
web-admin: {
enabled: false,
}
}

zookeeper: “vdw-coordinator01:2181”
zk.client.session.timeout: 1800000

-------------- vdw-executor02 --------------------

Summary

paths: {
local: “/opt/dremio/data”
dist: “pdfs://”{paths.local}"/pdfs" db: {paths.local}/db,
spilling: [{paths.local}/spill] accelerator: {paths.dist}/accelerator
downloads: {paths.dist}/downloads uploads: {paths.dist}/uploads
results: {paths.dist}/results scratch: {paths.dist}/scratch
}

services: {
coordinator: {
enabled: false,
auto-upgrade: false

master: {
  enabled: false
},

web: {
  enabled: false
},

client-endpoint: {
  port: 31010
},

scheduler: {
  threads: 24
},

command-pool: {
  enabled: true
  size: 0 # 0 defaults to the machine's number of cores
}

},

executor: {
enabled: true
},

flight: {
enabled: true,
port: 32010,
auth.mode: “arrow.flight.auth2”
}

fabric: {
port: 45678,

memory: {
  reservation: 100M
}

},
web-admin: {
enabled: false,
}
}

zookeeper: “vdw-coordinator01:2181”
zk.client.session.timeout: 1800000

@kylevorster

ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [vdw-executor01:0] usually means vdw-executor01 went untresponsive, it is usually memory and most probably your executor did a Full GC, if not it would be CPU

Are the 2 Dremio clusters sized the same in terms of CPU and memory? Send us the profile of the one that worked and the one that failed

The new cluster has a bit more CPU and Memory.

The strange thing is, if I ran the same query on the old cluster then it works without any problems. Like I said this is a clean install brand new cluster, no queries saved, just running a single query. I’ve tried multiple queries, just a select count(*) from table limit 100 and the same thing.

Really strange, I usually don’t ask help on forums but I’ve tried everything I could think of.

I’ll send the profiles tomorrow morning when I get to the office. Thanks.

Here’s the two profiles for the same query on the new and old cluster

Old Cluster:
datastore-coordinator.zip (20.1 KB)

New Cluster:
vdw-coordinator01.zip (21.7 KB)

@kylevorster

From the new cluster, can you please send us output of “ps -ef | grep dremio” from vdw-executor02.local?

Also when you fire the query, how high does CPU and memory go on Dremio node activity screen?

CPU shows 0% on all nodes and memory jumps to 1.22%

vdw-coordinator01

Summary

[root@vdw-coordinator01 ~]# ps -ef | grep dremio
dremio 44672 1 4 09:57 ? 00:02:22 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/opt/dremio/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/dremio/server.gc -Ddremio.log.path=/var/log/dremio -Ddremio.plugins.path=/opt/dremio/plugins -Xmx4096m -XX:MaxDirectMemorySize=8192m -XX:+PrintClassHistogramAfterFullGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -Dio.netty.maxDirectMemory=0 -Dio.netty.tryReflectionSetAccessible=true -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -XX:+UseG1GC -cp /opt/dremio/conf:/opt/dremio/jars/:/opt/dremio/jars/ext/:/opt/dremio/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon
root 46527 44641 0 10:52 pts/0 00:00:00 grep --color=auto dremio

vdw-executor01

Summary

[root@vdw-executor01 ~]# ps -ef | grep dremio
dremio 44992 1 71 10:52 ? 00:00:07 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/opt/dremio/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/dremio/server.gc -Ddremio.log.path=/var/log/dremio -Ddremio.plugins.path=/opt/dremio/plugins -Xmx4096m -XX:MaxDirectMemorySize=8192m -XX:+PrintClassHistogramAfterFullGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -Dio.netty.maxDirectMemory=0 -Dio.netty.tryReflectionSetAccessible=true -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -XX:+UseG1GC -cp /opt/dremio/conf:/opt/dremio/jars/:/opt/dremio/jars/ext/:/opt/dremio/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon
root 45098 44110 0 10:52 pts/0 00:00:00 grep --color=auto dremio

vdw-executor02

Summary

[root@vdw-executor02 ~]# ps -ef | grep dremio
dremio 44041 1 3 10:01 ? 00:01:38 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.275.b01-0.el7_9.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/opt/dremio/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/dremio/server.gc -Ddremio.log.path=/var/log/dremio -Ddremio.plugins.path=/opt/dremio/plugins -Xmx4096m -XX:MaxDirectMemorySize=8192m -XX:+PrintClassHistogramAfterFullGC -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -Dio.netty.maxDirectMemory=0 -Dio.netty.tryReflectionSetAccessible=true -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -XX:+UseG1GC -cp /opt/dremio/conf:/opt/dremio/jars/:/opt/dremio/jars/ext/:/opt/dremio/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon
root 44824 43583 0 10:52 pts/0 00:00:00 grep --color=auto dremio

Here’s a snapshot of CPU and Memory when running a query

ExecutionSetupException: One or more nodes lost connectivity during query. Identified nodes were [vdw-executor01:0]

@kylevorster

That helps, can you please send “ps -ef | grep dremio” output from one of the executors in the old cluster?

Also on the new cluster, would you be able to reproduce the problem and grab files server.gc, server.gc.1 and server.gc.2, server.log from the 2 executors?

Thanks
Bali

Hey,

#datastore-coordinator

Summary

[root@datastore-coordinator ~]# ps -ef | grep dremio
dremio 8482 1 5 Jan14 ? 15:58:55 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/opt/dremio/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/dremio/server.gc -Ddremio.log.path=/var/log/dremio -Ddremio.plugins.path=/opt/dremio/plugins -Xmx4096m -XX:MaxDirectMemorySize=8192m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -Dio.netty.maxDirectMemory=0 -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -cp /etc/dremio:/opt/dremio/jars/:/opt/dremio/jars/ext/:/opt/dremio/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon

#datastore-executor01

Summary

[root@datastore-executor01 ~]# ps -ef | grep dremio
dremio 6458 1 23 Jan14 ? 2-19:51:15 /usr/lib/jvm/java-1.8.0-openjdk-1.8.0.262.b10-0.el7_8.x86_64/jre/bin/java -Djava.util.logging.config.class=org.slf4j.bridge.SLF4JBridgeHandler -Djava.library.path=/opt/dremio/lib -XX:+PrintGCDetails -XX:+PrintGCDateStamps -Xloggc:/var/log/dremio/server.gc -Ddremio.log.path=/var/log/dremio -Ddremio.plugins.path=/opt/dremio/plugins -Xmx4096m -XX:MaxDirectMemorySize=8192m -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/var/log/dremio -Dio.netty.maxDirectMemory=0 -DMAPR_IMPALA_RA_THROTTLE -DMAPR_MAX_RA_STREAMS=400 -cp /etc/dremio:/opt/dremio/jars/:/opt/dremio/jars/ext/:/opt/dremio/jars/3rdparty/* com.dremio.dac.daemon.DremioDaemon

I’m sending you the logs directly, don’t want the data to be public

@kylevorster

Sorry, if I asked before. Is this the only job running? Will check logs and respond

Yup, only query on the new cluster, no one else has access it.

@kylevorster

That is really strange, can you check one more last thing? Check /var/log/messages for anything suspicious. BTW, did you send me the server.log from the executor after the error happens?

Yeah, I deleted all those logs, then rebooted and run the query again and got the error. Will check the messages log now

I notice these in the /var/log/messages

Summary

Jan 28, 2021 @ 10:50:23.000 # A fatal error has been detected by the Java Runtime Environment:
Jan 28, 2021 @ 10:50:23.000 #
Jan 28, 2021 @ 10:50:23.000 # SIGILL (0x4) at pc=0x00007fad041aeb40, pid=7482, tid=0x00007facf4071700
Jan 28, 2021 @ 10:50:23.000 #
Jan 28, 2021 @ 10:50:23.000 # JRE version: OpenJDK Runtime Environment (8.0_275-b01) (build 1.8.0_275-b01)
Jan 28, 2021 @ 10:50:23.000 # Java VM: OpenJDK 64-Bit Server VM (25.275-b01 mixed mode linux-amd64 compressed oops)
Jan 28, 2021 @ 10:50:23.000 # Problematic frame:
Jan 28, 2021 @ 10:50:23.000 # C 0x00007fad041aeb40
Jan 28, 2021 @ 10:50:23.000 #
Jan 28, 2021 @ 10:50:23.000 # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try “ulimit -c unlimited” before starting Java again
Jan 28, 2021 @ 10:50:23.000 #
Jan 28, 2021 @ 10:50:23.000 # An error report file with more information is saved as:
Jan 28, 2021 @ 10:50:23.000 # /tmp/hs_err_pid7482.log

Will send you that dump file via pm

It looks like it’s my case statement

h.billingcycle = VARCHAR

case
    when h.billingcycle = 'Monthly' then 30
    when h.billingcycle = 'Semi-Annually' then 182
    when h.billingcycle = 'Quarterly' then 91
    when h.billingcycle = 'onetime' then 365
    when h.billingcycle = 'One Time' then 365
    when h.billingcycle = 'Free Account' then 0
    when h.billingcycle = 'semiannually' then 182
    when h.billingcycle = 'Annually' then 365
    when h.billingcycle = 'Biennially' then 730
    when h.billingcycle = 'Triennially' then 1095
end as billing_days,

Thanks for that information @kylevorster , we will try to reproduce inhouse and get back to you

Thanks @balaji.ramaswamy

I did another test to see if it’s the case condition. I’ve exported the data from the old cluster doing a query without any filters select * from table and then downloaded the csv data for all three tables I’m joining.

Uploaded that to a nas shared on all three nodes of the new cluster and built the same query but using the csv files as data sources. This worked fine without any issues.

I’ve done another test to do the same csv download from the new cluster by running select * from table and then uploading all the csv files and that worked as well.

I’ll do some more testing on different data sources. The one giving issues at the moment is MySQL. I’ve got MSSQL, PostgreSQL and other MySQL sources.

WIll let you know. But for now, I’ve tried everything using this Data source and everything using case conditions returns the error as reported in this thread.

I just tried MSSQL and PostgreSQL and both connections gives me the same error.

If I query a table on those sources with just a normal select select * from table then it works without any issues but any queries that’s a bit more complicated, joins, conditions, casting gives the same error.

This is super strange. I’ve tried everything I could think of.

  • Install new cluster and copy old cluster’s data over and run queries - same error
  • Install new cluster and setup just the connections/sources and run queries - same error

Strange thing is, I tried upgrading my old cluster before setting up this new cluster and after upgrading from 4.1.0 then it started giving me that error.

@balaji.ramaswamy any news for me? Don’t want to sound pushy.

@kylevorster

Not yet, we would have to first identify the cause an then post a patch. Will update here when we fix it and it is available as part of a release

Thanks
Bali