New Dremio 4.7.2 installation not starting up

I’m trying to build a new Dremio installation in our DEV environment on OpenShift and it is failing to pass the dremio upgrade step…

I started a dremio-admin pod and I’m seeing the following errors:

dremio@dremio-admin:/opt/dremio/bin$ ./dremio-admin clean
Lock file to RocksDB is currently held by another process.  Stop other process before retrying.
Failed to complete cleanup.
org.rocksdb.RocksDBException: While lock file: /opt/dremio/data/db/catalog/LOCK: No locks available
        Suppressed: java.lang.IllegalStateException: #start was not invoked, so metadataManager is not available

Digging deeper I’m seeing RocksDB issues:

dremio@dremio-admin:/opt/dremio/bin$ cat /opt/dremio/data/db/catalog/LOG
2020/09/24-21:44:55.014406 7fc7dfddb700 RocksDB version: 5.14.2
2020/09/24-21:44:55.014470 7fc7dfddb700 Git sha rocksdb_build_git_sha:
2020/09/24-21:44:55.014475 7fc7dfddb700 Compile date Jul  6 2018
2020/09/24-21:44:55.014483 7fc7dfddb700 DB SUMMARY
2020/09/24-21:44:55.014958 7fc7dfddb700 SST files in /opt/dremio/data/db/catalog dir, Total Num: 0, files:
2020/09/24-21:44:55.014968 7fc7dfddb700 Write Ahead Log file in /opt/dremio/data/db/catalog:
2020/09/24-21:44:55.014972 7fc7dfddb700                         Options.error_if_exists: 0
2020/09/24-21:44:55.014977 7fc7dfddb700                       Options.create_if_missing: 1
2020/09/24-21:44:55.014979 7fc7dfddb700                         Options.paranoid_checks: 1
2020/09/24-21:44:55.014981 7fc7dfddb700                                     Options.env: 0x7fc7c37fb060
2020/09/24-21:44:55.014983 7fc7dfddb700                                Options.info_log: 0x7fc7da110ab0
2020/09/24-21:44:55.014984 7fc7dfddb700                Options.max_file_opening_threads: 16
2020/09/24-21:44:55.014986 7fc7dfddb700                              Options.statistics: (nil)
2020/09/24-21:44:55.014988 7fc7dfddb700                               Options.use_fsync: 0
2020/09/24-21:44:55.014989 7fc7dfddb700                       Options.max_log_file_size: 0
2020/09/24-21:44:55.014991 7fc7dfddb700                  Options.max_manifest_file_size: 18446744073709551615
2020/09/24-21:44:55.014992 7fc7dfddb700                   Options.log_file_time_to_roll: 0
2020/09/24-21:44:55.014994 7fc7dfddb700                       Options.keep_log_file_num: 1000
2020/09/24-21:44:55.014995 7fc7dfddb700                    Options.recycle_log_file_num: 0
2020/09/24-21:44:55.014997 7fc7dfddb700                         Options.allow_fallocate: 1
2020/09/24-21:44:55.014998 7fc7dfddb700                        Options.allow_mmap_reads: 0
2020/09/24-21:44:55.015000 7fc7dfddb700                       Options.allow_mmap_writes: 0
2020/09/24-21:44:55.015001 7fc7dfddb700                        Options.use_direct_reads: 0
2020/09/24-21:44:55.015002 7fc7dfddb700                        Options.use_direct_io_for_flush_and_compaction: 0
2020/09/24-21:44:55.015004 7fc7dfddb700          Options.create_missing_column_families: 0
2020/09/24-21:44:55.015005 7fc7dfddb700                              Options.db_log_dir:
2020/09/24-21:44:55.015007 7fc7dfddb700                                 Options.wal_dir: /opt/dremio/data/db/catalog
2020/09/24-21:44:55.015008 7fc7dfddb700                Options.table_cache_numshardbits: 6
2020/09/24-21:44:55.015010 7fc7dfddb700                      Options.max_subcompactions: 1
2020/09/24-21:44:55.015011 7fc7dfddb700                  Options.max_background_flushes: -1
2020/09/24-21:44:55.015013 7fc7dfddb700                         Options.WAL_ttl_seconds: 300
2020/09/24-21:44:55.015014 7fc7dfddb700                       Options.WAL_size_limit_MB: 0
2020/09/24-21:44:55.015016 7fc7dfddb700             Options.manifest_preallocation_size: 4194304
2020/09/24-21:44:55.015017 7fc7dfddb700                     Options.is_fd_close_on_exec: 1
2020/09/24-21:44:55.015019 7fc7dfddb700                   Options.advise_random_on_open: 1
2020/09/24-21:44:55.015020 7fc7dfddb700                    Options.db_write_buffer_size: 0
2020/09/24-21:44:55.015022 7fc7dfddb700                    Options.write_buffer_manager: 0x7fc7da10fdc0
2020/09/24-21:44:55.015023 7fc7dfddb700         Options.access_hint_on_compaction_start: 1
2020/09/24-21:44:55.015025 7fc7dfddb700  Options.new_table_reader_for_compaction_inputs: 0
2020/09/24-21:44:55.015026 7fc7dfddb700           Options.random_access_max_buffer_size: 1048576
2020/09/24-21:44:55.015030 7fc7dfddb700                      Options.use_adaptive_mutex: 0
2020/09/24-21:44:55.015032 7fc7dfddb700                            Options.rate_limiter: (nil)
2020/09/24-21:44:55.015035 7fc7dfddb700     Options.sst_file_manager.rate_bytes_per_sec: 0
2020/09/24-21:44:55.015036 7fc7dfddb700                       Options.wal_recovery_mode: 2
2020/09/24-21:44:55.015042 7fc7dfddb700                  Options.enable_thread_tracking: 0
2020/09/24-21:44:55.015044 7fc7dfddb700                  Options.enable_pipelined_write: 0
2020/09/24-21:44:55.015045 7fc7dfddb700         Options.allow_concurrent_memtable_write: 1
2020/09/24-21:44:55.015047 7fc7dfddb700      Options.enable_write_thread_adaptive_yield: 1
2020/09/24-21:44:55.015048 7fc7dfddb700             Options.write_thread_max_yield_usec: 100
2020/09/24-21:44:55.015050 7fc7dfddb700            Options.write_thread_slow_yield_usec: 3
2020/09/24-21:44:55.015051 7fc7dfddb700                               Options.row_cache: None
2020/09/24-21:44:55.015053 7fc7dfddb700                              Options.wal_filter: None
2020/09/24-21:44:55.015054 7fc7dfddb700             Options.avoid_flush_during_recovery: 0
2020/09/24-21:44:55.015056 7fc7dfddb700             Options.allow_ingest_behind: 0
2020/09/24-21:44:55.015058 7fc7dfddb700             Options.preserve_deletes: 0
2020/09/24-21:44:55.015059 7fc7dfddb700             Options.two_write_queues: 0
2020/09/24-21:44:55.015061 7fc7dfddb700             Options.manual_wal_flush: 0
2020/09/24-21:44:55.015063 7fc7dfddb700             Options.max_background_jobs: 2
2020/09/24-21:44:55.015065 7fc7dfddb700             Options.max_background_compactions: -1
2020/09/24-21:44:55.015066 7fc7dfddb700             Options.avoid_flush_during_shutdown: 0
2020/09/24-21:44:55.015068 7fc7dfddb700           Options.writable_file_max_buffer_size: 1048576
2020/09/24-21:44:55.015069 7fc7dfddb700             Options.delayed_write_rate : 16777216
2020/09/24-21:44:55.015071 7fc7dfddb700             Options.max_total_wal_size: 0
2020/09/24-21:44:55.015072 7fc7dfddb700             Options.delete_obsolete_files_period_micros: 21600000000
2020/09/24-21:44:55.015074 7fc7dfddb700                   Options.stats_dump_period_sec: 600
2020/09/24-21:44:55.015075 7fc7dfddb700                          Options.max_open_files: -1
2020/09/24-21:44:55.015077 7fc7dfddb700                          Options.bytes_per_sync: 0
2020/09/24-21:44:55.015078 7fc7dfddb700                      Options.wal_bytes_per_sync: 0
2020/09/24-21:44:55.015079 7fc7dfddb700       Options.compaction_readahead_size: 0
2020/09/24-21:44:55.015081 7fc7dfddb700 Compression algorithms supported:
2020/09/24-21:44:55.015083 7fc7dfddb700         kZSTDNotFinalCompression supported: 1
2020/09/24-21:44:55.015089 7fc7dfddb700         kZSTD supported: 1
2020/09/24-21:44:55.015091 7fc7dfddb700         kXpressCompression supported: 0
2020/09/24-21:44:55.015093 7fc7dfddb700         kLZ4HCCompression supported: 1
2020/09/24-21:44:55.015095 7fc7dfddb700         kLZ4Compression supported: 1
2020/09/24-21:44:55.015096 7fc7dfddb700         kBZip2Compression supported: 1
2020/09/24-21:44:55.015098 7fc7dfddb700         kZlibCompression supported: 1
2020/09/24-21:44:55.015100 7fc7dfddb700         kSnappyCompression supported: 1
2020/09/24-21:44:55.015106 7fc7dfddb700 Fast CRC32 supported: Not supported on x86
2020/09/24-21:44:55.049247 7fc7dfddb700 [db/db_impl.cc:252] Shutdown: canceling all background work
2020/09/24-21:44:55.049368 7fc7dfddb700 [db/db_impl.cc:398] Shutdown complete
dremio@dremio-admin:/opt/dremio/bin$                                                                                              

Seems like Fast CRS32 issue? I tried to manually delete the LOCK file, but dremio-admin just creates a new LOCK file and complains RocksDB is locked.

@david.lee

Have you tried to scale down completely, make sure no services are running, then bring up Dremio on 4.7.2. Once that is done you can again bring up admin pod and try the “clean”

What is the reason you are doing “clean” during the upgrade?

I haven’t been able to bring up Dremio even for the first time, so I started an Admin Pod to look at the error logs, etc… generated…

I tried running dremio-admin upgrade manually from the Admin Pod and got this:

dremio@dremio-admin:/opt/dremio/bin$ rm /opt/dremio/data/db/catalog/LOCK
dremio@dremio-admin:/opt/dremio/bin$ ./dremio-admin upgrade
Lock file to RocksDB is currently held by another process.  Stop other process before retrying.

Upgrade failed
org.rocksdb.RocksDBException: While lock file: /opt/dremio/data/db/catalog/LOCK: No locks available
        Suppressed: java.lang.IllegalStateException: #start was not invoked, so metadataManager is not available

I did get the initContrainer for the dremio-upgrade task working by deleting the entire /opt/dremio/data/db directory, but now I see this error when dremio starts up:

starting dremio
2020-09-25T17:11:55.405+0000: [GC pause (G1 Evacuation Pause) (young), 0.0081882 secs]
   [Parallel Time: 6.7 ms, GC Workers: 1]
      [GC Worker Start (ms):  617.3]
      [Ext Root Scanning (ms):  1.5]
.....
.....
.....
2020-09-25 17:12:13,969 [main] INFO  c.d.s.c.TaskLeaderStatusListener - TaskLeaderStatusListener for: MASTER is up
2020-09-25 17:12:13,970 [main] INFO  c.d.s.fabric.FabricServiceImpl - fabric service has 104857600 bytes reserved
2020-09-25 17:12:14,371 [main] INFO  c.d.services.fabric.FabricServer - [FABRIC]: Server started on port 45678.
2020-09-25 17:12:14,898 [main] INFO  c.d.s.conduit.server.ConduitServer - ConduitServer is up. Listening on port '45679'
2020-09-25 17:12:14,898 [main] INFO  c.d.datastore.LocalKVStoreProvider - Starting LocalKVStoreProvider
2020-09-25 17:12:15,114 [main] INFO  c.dremio.datastore.ByteStoreManager - Lock file to RocksDB is currently held by another process. Will wait until lock is freed.
Lock file to RocksDB is currently held by another process. Will wait until lock is freed.

There is nothing running other than the Master Coordinator Pod… I’ve got the executors set with 0 replicas…

It looks like it is just trying to restart and running into the RocksDB lock issue everytime…

-----/db/catalog [DEV]> ls -alt | head -10
total 35176
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG
drwxr-xr-x 3 999 root 339968 Sep 25 13:24 ./
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG.old.1601054641084924
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG.old.1601054640914144
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG.old.1601054640740870
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG.old.1601054640569059
-rw-r--r-- 1 999 root   7191 Sep 25 13:24 LOG.old.1601054640394098
-rw-r--r-- 1 999 root   7191 Sep 25 13:23 LOG.old.1601054640220657
-rw-r--r-- 1 999 root   7191 Sep 25 13:23 LOG.old.1601054640047373

etc.

@david.lee

How did we check that there was no Dremio process running? Did we do a “ps -ef | grep -i dremio”, When you got this “Lock file to RocksDB is currently held by another process. Stop other process before retrying”, we should not remove the lock file

Do you backup the “db” folder, Can we provision new pods and restore the db folder and start Dremio?

This is a new install with no existing db so no need to back it up.

The only pod that is spinning up is the master pod with no executors. I think the master pod is locking itself somehow…

I already ran ps -ef | grep -i dremio and nothing was found

@david.lee

When this happens please send us the kubectl logs from the master

Been busy lately so I tried to do a fresh install of version 11…

Same issue as 4.7.2. Will send over the Master Logs once I get them from the admins.

I started a dremio-admin pod and manually tried to perform the upgrade. I even tried deleting the catalog directory and it just Locks itself.

$ ls ../data/db/catalog
LOCK  LOG  archive
$ rm -r ../data/db/catalog
$ ./dremio-admin upgrade
Lock file to RocksDB is currently held by another process.  Stop other process before retrying.

Upgrade failed
org.rocksdb.RocksDBException: While lock file: /opt/dremio/data/db/catalog/LOCK: No locks available
        Suppressed: java.lang.IllegalStateException: #start was not invoked, so metadataManager is not available

$ ls ../data/db/catalog
LOCK  LOG  archive

@david.lee

You deleted the catalog folder? That may cause additional issues. Do you have a backup or is this a fresh start?

It’s a Fresh Start… Still working on getting logs. It’s a bit slow right now with the holiday…

I’m trying to install the Dremio helm chart with a couple changes like not running the Pods as ROOT which is a security flaw…

Couple things I’ve done…

Removed the CHMOD step and manually ran CHMOD on mounted storage to 999 for Dremio and 1000 for Zookeeper

Problem 1: The BusyBox image version of PING is ROOT executable only. Tried using NGINX UTILS image which is Debian, but for some reason it doesn’t like to run the same shell command:

command: [“sh”, “-c”, “until ping -c 1 -W 1 zk-hs > /dev/null; do echo Waiting for Zookeeper to be ready.; sleep 2; done;”]

/bin/sh: 3: Syntax error: “do” unexpected

I temporarily changed it to just sleep for 30 seconds…

Problem 2: This RocksDB Lock files gets recreated running any type of dremio-admin or dremio start command and Dremio locks itself…

Waiting on Logs…