Zookeeper failing to start

Hello community,

I am trying to deploy dremio using helm in EKS Cluster but the zookeeper fails to start most of the time. Below are my configs.
Zookeeper replicas = 3
Zookeeper logs:
zk-0.log:
2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.(QuorumPeer.java:151)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parseProperties(QuorumPeerConfig.java:238)
at org.apache.zookeeper.server.quorum.QuorumPeerConfig.parse(QuorumPeerConfig.java:150)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.initializeAndRun(QuorumPeerMain.java:101)
at org.apache.zookeeper.server.quorum.QuorumPeerMain.main(QuorumPeerMain.java:78)
zk-1.log:
2020-12-12 09:21:04,659 [myid:2] - WARN [WorkerSender[myid=2]:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:595)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)
at java.lang.Thread.run(Thread.java:748)
zk-2.log:
2020-12-12 09:22:12,837 [myid:3] - WARN [WorkerSender[myid=3]:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-1.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-1.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution
at java.net.Inet4AddressImpl.lookupAllHostAddr(Native Method)
at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:928)
at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1323)
at java.net.InetAddress.getAllByName0(InetAddress.java:1276)
at java.net.InetAddress.getAllByName(InetAddress.java:1192)
at java.net.InetAddress.getAllByName(InetAddress.java:1126)
at java.net.InetAddress.getByName(InetAddress.java:1076)
at org.apache.zookeeper.server.quorum.QuorumPeer$QuorumServer.recreateSocketAddresses(QuorumPeer.java:166)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:595)
at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:538)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:452)
at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:433)
at java.lang.Thread.run(Thread.java:748)

And the Interesting part is sometimes zookeeper starts without any issues.

Hi @shibily

I see you are running into a name resolution issue. See below error. are we sure this is not a DNS issue and if the IP and hostname of the ZK host are resolvable from the Dremio coordinator

Try using IP address of the Zoo keeper hosts in dremio.conf

2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local

Hi buddy,

I am trying to deploy this using official Dremio helm charts(which I believe is stable) and don’t really want to get into those configs as I am a beginner here.

Confirming again, is the helm chart stable?

@shibily

Helm charts are stable. The reason it is failing is Dremio is unable to resolve “zk-2.zk-hs.dremio.svc.cluster.local”

Are you able to ping “zk-2.zk-hs.dremio.svc.cluster.local” from the Dremio coordinator pod?

@balaji.ramaswamy No pinging to “zk-2.zk-hs.dremio.svc.cluster.local” fails. It’s strange that zookeeper starts without any problems 6/10 times I provision them.

And coordinator replicas are set to 0 by default in the helm chart. Pointing the helm chart URL below.

@shibily

Dremio is unable to resolve the ZK hostname, is there an associated IP you are able to ping? Maybe try that instead of the hostname?

@balaji.ramaswamy no, still the same issue.

PS: This is my new account as I had changed my organization.

@shibilys

Can you please do nslookup on zk-2.zk-hs.dremio.svc.cluster.local, zk-0.zk-hs.dremio.svc.cluster.local and zk-1.zk-hs.dremio.svc.cluster.local and see if it returns the same result, wondering why you are getting

2020-12-12 09:20:32,948 [myid:] - WARN [main:QuorumPeer$QuorumServer@173] - Failed to resolve address: zk-2.zk-hs.dremio.svc.cluster.local
java.net.UnknownHostException: zk-2.zk-hs.dremio.svc.cluster.local: Temporary failure in name resolution

@balaji.ramaswamy It is returning the same result. Can you try to reproduce by deploying the helm chart? dremio-cloud-tools/charts/dremio_v2 at master · dremio/dremio-cloud-tools · GitHub

@shibilys We have many customers using our helm charts, lets do this. Can we start with your helm charts? Are you able to provide us with yours?

@balaji.ramaswamy thank you for your support. I figured out that pods in my worker nodes has no access outside it because of some AWS Security group rule. Everything is working as normal.

Thanks again <3