Dremio service script not working

Hi,
I am trying to enable dremio to start at boot.
Followed the steps in https://docs.dremio.com/deployment/standalone/standalone-tarball.html
I did a tar ball installation. My setup is to use 1 coordinator and 3 worker nodes.
I can start frm the dremio script under /bi and start the application in each node. And it works fine. However while using service dremio start, its not starting the service.
Here is my config. Note that i am trying to use root user. This is from /etc/systemd/system/dremio.service
[Unit]
Description=Dremio Daemon Server
Documentation=
Wants=network-online.target
After=network-online.target

[Service]
User=root
Group=root
RuntimeDirectory=dremio
ExecStart=/opt/dremio/bin/dremio start
Restart=on-failure

Set maximum number of file descriptors Dremio can open.

Dremio requires a large number to work correctly.

LimitNOFILE=65536

[Install]
WantedBy=multi-user.target

And my dremio.conf is

services: {
coordinator.enabled: false,
coordinator.master.enabled: false,
executor.enabled: true,
coordinator.master.embedded-zookeeper.enabled: false
}

zookeeper: “xxx:2181,xxx:2181,xxx:2181,xxx:2181”

Is it same for all nodes ?

service dremio status output
Redirecting to /bin/systemctl status dremio.service
● dremio.service - Dremio Daemon Server
Loaded: loaded (/etc/systemd/system/dremio.service; enabled; vendor preset: disabled)
Active: inactive (dead) since Mon 2019-09-09 11:51:05 EDT; 17min ago
Docs:
Main PID: 2402 (code=exited, status=0/SUCCESS)

Sep 09 11:51:04 [1]: Started Dremio Daemon Server.
Sep 09 11:51:04 [2402]: Starting dremio, logging to /opt/dremio/log/server.out
Sep 09 11:51:05 dremio[2402]: Check the log file to make sure dremio started successfully and did not run into any issues.

and my server.out
15:59:19,974 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[null] - Active log file name: /opt/dremio/log/hive.deprecated.function.warning.log
15:59:19,974 |-INFO in ch.qos.logback.core.rolling.RollingFileAppender[null] - File property is set to [/opt/dremio/log/hive.deprecated.function.warning.log]
15:59:19,974 |-INFO in ch.qos.logback.classic.jul.LevelChangePropagator@26e356f0 - Propagating ERROR level on Logger[ROOT] onto the JUL framework
15:59:19,974 |-INFO in ch.qos.logback.classic.joran.action.LevelAction - ROOT level set to ERROR
15:59:19,977 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [text] to Logger[ROOT]
15:59:19,977 |-INFO in ch.qos.logback.core.joran.action.AppenderRefAction - Attaching appender named [json] to Logger[ROOT]
15:59:19,977 |-INFO in ch.qos.logback.classic.joran.action.ConfigurationAction - End of configuration.
15:59:19,979 |-INFO in ch.qos.logback.classic.joran.JoranConfigurator@2dc54ad4 - Registering current configuration as safe fallback point

Can anyone help me with this?

@mmohan

If you cd to the bin folder and try “./dremio start” , does it start.? Also in the above case have you actually validated logging on to the UI?

What about the server.log? Does it have any errors or it says “Started on localhost”?

Hi Balaji,
Yes from the ./dremio it starts. I was able to make some progress on this. i added pid file location in file /etc/systemd/system/dremio.service as PIDFile=/opt/dremio/run/dremio.pid after which service script started showing result. Next issue i faced was. when i start fomr service script, ./dremio stop from bin directory was not working. its not ale to find the pid file location. I haven’t made any change in any config files or the scripts.
Next i added startup script to /etc/init.d. I can see only the edge node (coordinator) start ups. other nodes are not starting.
After restart i am seeing that service dremio is not working. not showing any output and freeze execution.

Thanks
Mahesh

@mmohan

Need more clarity.

Are you saying your coordinator is up and executors are failing due to PID. Have you checked the permissions on the PID folder. What is the exact error you are getting?

Hi Balaji,
I made to start the service as systemctl start dremio. I can see an error in log as
Dremio is exiting. Failure while starting services.
com.dremio.service.namespace.RemoteNamespaceException: dataset listing failed: com.dremio.common.exceptions.UserRemoteException: SYSTEM ERROR: NullPointerException: Unknown protocol 57

[Error Id: acb80465-7b85-414f-ab36-43d10711c323 on :0]

(com.dremio.exec.rpc.UserRpcException) java.lang.NullPointerException: Unknown protocol 57
com.dremio.exec.rpc.RpcBus$RequestEvent.run():466
com.dremio.common.SerializedExecutor$RunnableProcessor.run():96
com.dremio.exec.rpc.RpcBus$SameExecutor.execute():328
com.dremio.common.SerializedExecutor.execute():129
com.dremio.exec.rpc.RpcBus$InboundHandler.decode():362
com.dremio.exec.rpc.RpcBus$InboundHandler.decode():333
io.netty.handler.codec.MessageToMessageDecoder.channelRead():88
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead():340
io.netty.handler.timeout.IdleStateHandler.channelRead():286
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead():340
io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead():310
io.netty.handler.codec.ByteToMessageDecoder.channelRead():284
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead():340
io.netty.channel.ChannelInboundHandlerAdapter.channelRead():86
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():362
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead():348
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead():340
io.netty.channel.nio.NioEventLoop.processSelectedKey():645
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized():580
io.netty.channel.nio.NioEventLoop.processSelectedKeys():497
io.netty.channel.nio.NioEventLoop.run():459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run():886
java.lang.Thread.run():748
Caused By (java.lang.NullPointerException) Unknown protocol 57
com.google.common.base.Preconditions.checkNotNull():787
com.dremio.services.fabric.FabricMessageHandler.getProtocol():130
com.dremio.services.fabric.FabricMessageHandler.handle():56
com.dremio.services.fabric.FabricServer.handle():84
com.dremio.services.fabric.FabricServer.handle():39
com.dremio.exec.rpc.RpcBus$RequestEvent.run():462
com.dremio.common.SerializedExecutor$RunnableProcessor.run():96
io.netty.channel.nio.NioEventLoop.run():459
io.netty.util.concurrent.SingleThreadEventExecutor$5.run():886
java.lang.Thread.run():748

    at com.dremio.service.listing.DatasetListingInvoker.getSources(DatasetListingInvoker.java:321)
    at com.dremio.exec.catalog.PluginsManager.start(PluginsManager.java:148)
    at com.dremio.exec.catalog.CatalogServiceImpl.start(CatalogServiceImpl.java:162)
    at com.dremio.service.SingletonRegistry$AbstractServiceReference.start(SingletonRegistry.java:137)
    at com.dremio.dac.daemon.NonMasterSingletonRegistry.start(NonMasterSingletonRegistry.java:54)
    at com.dremio.dac.daemon.DACDaemon.startServices(DACDaemon.java:184)
    at com.dremio.dac.daemon.DACDaemon.init(DACDaemon.java:190)
    at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:104)

@mmohan

Usually this happens when the executor and coordinator are out of sync, lets do this

Shutdown the executors
Shutdown the coordinator
On the coordinator do a ps -ef | grep dremio, should not return anything (please paste output in reply)
On the executor do a ps -ef | grep dremio, should not return anything (please paste output in reply)

If you see any orphan dremio process, kill them

Start coordinator
Start Executors

@balaji.ramaswamy, i tried the above steps and there were no orphan process after killing existing one. In fact the executors were not started at all. I guess the startup process may cause issue. I am starting cluster from a jenkins job and it starts all nodes at same time. In case if dremio startup in each node is executing in parallel and executor is not finding coordinator, will it be a problem?

The executors can come up in any order, but coordinator needs to be up first

Ok Thanks, Let me try adding a sleep in the executor startup script and see how it goes.

Hi @balaji.ramaswamy, Issue is solved now after adding a loop ro check for zookeeper. Here are the things i did.
added a sleep step and restarted cluster. This is to make sure the coordinator is up before workers start. This didnt help. I was using the already installed zookeeper. I tried adding a infinite loop to wait until zookeeper is up and this solved the issue. Thanks for all suggestions and guidance.