Zookeeper failing to start

shibily · December 12, 2020, 9:24am

Hello community,

I am trying to deploy dremio using helm in EKS Cluster but the zookeeper fails to start most of the time. Below are my configs.
Zookeeper replicas = 3
Zookeeper logs:
zk-0.log:
2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.(QuorumPeer.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:238)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:150)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:101)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
zk-1.log:
2020-12-12 09:21:04,659 [myid:2] - WARN [WorkerSender[myid=2]:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:595)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)
at java.lang.Thread.run(Thread.java:748)
zk-2.log:
2020-12-12 09:22:12,837 [myid:3] - WARN [WorkerSender[myid=3]:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-1.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-1.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:595)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)
at java.lang.Thread.run(Thread.java:748)

And the Interesting part is sometimes zookeeper starts without any issues.

balaji.ramaswamy · December 13, 2020, 8:50am

Hi @shibily

I see you are running into a name resolution issue. See below error. are we sure this is not a DNS issue and if the IP and hostname of the ZK host are resolvable from the Dremio coordinator

Try using IP address of the Zoo keeper hosts in dremio.conf

2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local

shibily · December 13, 2020, 10:29am

Hi buddy,

I am trying to deploy this using official Dremio helm charts(which I believe is stable) and don’t really want to get into those configs as I am a beginner here.

Confirming again, is the helm chart stable?

balaji.ramaswamy · December 14, 2020, 4:49am

@shibily

Helm charts are stable. The reason it is failing is Dremio is unable to resolve “zk-2.zk-hs.dremio.svc.cluster.local”

Are you able to ping “zk-2.zk-hs.dremio.svc.cluster.local” from the Dremio coordinator pod?

shibily · December 14, 2020, 6:12am

@balaji.ramaswamy No pinging to “zk-2.zk-hs.dremio.svc.cluster.local” fails. It’s strange that zookeeper starts without any problems 6/10 times I provision them.

And coordinator replicas are set to 0 by default in the helm chart. Pointing the helm chart URL below.

balaji.ramaswamy · December 15, 2020, 6:54am

@shibily

Dremio is unable to resolve the ZK hostname, is there an associated IP you are able to ping? Maybe try that instead of the hostname?

shibilys · May 24, 2021, 12:25pm

@balaji.ramaswamy no, still the same issue.

PS: This is my new account as I had changed my organization.

balaji.ramaswamy · May 25, 2021, 7:02am

@shibilys

Can you please do nslookup on zk-2.zk-hs.dremio.svc.cluster.local, zk-0.zk-hs.dremio.svc.cluster.local and zk-1.zk-hs.dremio.svc.cluster.local and see if it returns the same result, wondering why you are getting

2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution

shibilys · May 25, 2021, 7:37am

@balaji.ramaswamy It is returning the same result. Can you try to reproduce by deploying the helm chart? dremio-cloud-tools/charts/dremio_v2 at master · dremio/dremio-cloud-tools · GitHub

balaji.ramaswamy · May 26, 2021, 5:53am

@shibilys We have many customers using our helm charts, lets do this. Can we start with your helm charts? Are you able to provide us with yours?

shibilys · May 30, 2021, 9:11am

@balaji.ramaswamy thank you for your support. I figured out that pods in my worker nodes has no access outside it because of some AWS Security group rule. Everything is working as normal.

Thanks again <3

Topic		Replies	Views
Standalone cluster Dremio University	1	1560	March 18, 2021
Failure while attempting to create SimpleUserService	18	3180	April 12, 2018
Zookeeper.service failed	1	1206	April 20, 2022
Failed Zookeeper with k8s Dremio University	3	1506	April 11, 2023
Dremio no starrting zk	4	966	May 11, 2020

Zookeeper failing to start

Related topics