Restart dremio service and Fail new election

Hello there,

I have two master dremium for high availability, but when restarted the service of primary node master, the second master node in stand-by fails while trying to take over as the new coordinator.

Dremio.conf

paths: {
  # the local path for dremio to store data.
  local: "/mnt/dremio-metadata"

dist: "s3a://vlr-dremio4-prd/dremio-storage/"
  # the distributed path Dremio data including job results, downloads, uploads, etc
  #dist: "pdfs://"${paths.local}"/pdfs"
}

services: {
  coordinator.enabled: true,
  coordinator.master.enabled: true,
  executor.enabled: false,
  coordinator.master.embedded-zookeeper.enabled: true
}

Server log:
server.log.zip (4,9,KB)

2 Likes

@caiounderscore

I see your secondary coordinator is unable to talk to the ZooKeeper, see below. Are you be able to ping and telnet to the Zookeeper on the configured port from the secondary coordinator?

2019-11-21 05:12:48,406 [zk-curator-2] INFO c.d.s.coordinator.zk.ZKClusterClient - Not able to get election status in 60000ms. Cancelling election…
2019-11-21 05:12:48,407 [zk-curator-2] ERROR ROOT - Dremio is exiting. Node lost its master status.
2019-11-21 05:12:55,713 [main] INFO com.dremio.common.config.SabotConfig - Configuration and plugin file(s) identified in 73ms.
Base Configuration:

@balaji.ramaswamy

I don’t use ZooKeeper external, I use ZooKeeper embedded, in other words, my secondary coordinator will be the ZooKeeper.

After secondary coordinator fails while trying to take a new coordinator, the first coordinator retake a new master, but, if i will stop service dremio service of first coordinator or reboot machine, the secondary master retake fine for a new coordinator.

The main problem is if restart service dremio of primary master, the secondary master fails when trying take over as the new coordinator.

@caiounderscore

External ZK is a requirement for HA

https://docs.dremio.com/advanced-administration/high-availability.html

Thanks