Hello Dremio Community,
Hope to get some insights and help here.
I am new to Dremio but I look at digging deep as Dev. Ops. who maintains the Dremio OSS instance for our data analyst team who actually uses it.
(Pardon if this is too much of a story)
A few months back one of our junior teammate created a custom Kubernetes setup of Dremio (with loose manifest scripts) - this one does not have master-coordinator-executor - just a single instance (no zookeeper).
This works fine and now has a lot of data/queries set up in it. I need to perform a Dremio version upgrade and also move this to a Helm based installation in a new K8S cluster (AKS).
So I followed the routine - I did EXEC -it into the single instance Dremio and ran backup using dremio-admin
(note: the service was running when I did this - but as mentioned in docs - seems that was ok)
kubectl cp
to my windows machine after I created a .tar.gz on the dremio instance itself so I only need to cp a single file.
Next step I set a Helm Dremio installation - I am using AzureFiles storage class on AKS and on the first installation it works fine - but obviously does not have all those data/table/query from our old Dremio yet.
So I followed the Helm switch for --set DremioAdmin=true
and then EXEC -it into the admin pod - again kubectl cp
the backup tar.gz to the AzureFile master PVC. Extracted the tar to a folder and tried running ./dremio-admin restore -d /somepath -r
This did not go well - It does say the restore happened but also an error related to operation not permitted.
So I started editing the Helm _v2
folder and updated the dremio-admin manifest and added
securityContext:
allowPrivilegeEscalation: false
runAsUser: 0
and did Helm upgrade
- now the restore worked well.
I also did dremio-admin
clean
for -c
-o
-i
and upgrade
- all these went well too.
Switching back with --set DremioAdmin=false
and now the dremio-master-0 fails - on a bit of digging I found the following error:
2020-11-05T04:42:43.411+0000: [GC (Allocation Failure) [PSYoungGen: 136808K->8187K(260096K)] 136912K->17255K(432128K), 0.0148694 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2020-11-05T04:42:43.887+0000: [GC (Allocation Failure) [PSYoungGen: 260091K->14307K(266240K)] 269159K->40859K(438272K), 0.0357166 secs] [Times: user=0.09 sys=0.01, real=0.04 secs]
2020-11-05T04:42:44.408+0000: [GC (Allocation Failure) [PSYoungGen: 266211K->25573K(523264K)] 292763K->55072K(695296K), 0.0183565 secs] [Times: user=0.03 sys=0.01, real=0.01 secs]
2020-11-05 04:42:44,691 [main] INFO c.d.common.scanner.ClassPathScanner - Scanning packages [com.dremio.sabot.task.slicing.SlicingTaskPool, com.dremio.dac, com.dremio.dac.support.SupportService, com.dremio.service.cachemanager, com.dremio.plugins.azure, com.dremio.extra.exec.store.dfs, com.dremio.exec.planner.acceleration.substitution, com.dremio.options, com.dremio.telemetry.api, com.dremio.service.namespace, com.dremio.plugins.adl.store, com.dremio.plugins.mongo, com.dremio.telemetry.utils, com.dremio.plugins.s3.store, com.dremio.provision.yarn.service, com.dremio.service.jobtelemetry.server.store, com.dremio.service.users, com.dremio.exec.fn.hive, com.dremio.service.accelerator, com.dremio.service.reflection, com.dremio.service.voting, com.dremio.exec.store.jdbc, com.dremio.exec.store.hive, com.dremio.exec.store.dfs, com.dremio.resource, com.dremio.resource.basic, com.dremio.exec.store.mock, com.dremio.common.logical, com.dremio.exec.store.dfs, com.dremio.exec.server.options, com.dremio.exec.store.hive, com.dremio.exec.store.hive.exec, com.dremio.dac, com.dremio.dac.support.SupportService, com.dremio.dac.cmd, com.dremio.dac.cmd.upgrade, com.dremio.extras.plugins.elastic, com.dremio.provision, com.dremio.services.configuration, com.dremio.services.configuration.ConfigurationStore, com.dremio.exec.store.jdbc, com.dremio.exec.store.dfs, com.dremio.joust.geo, com.dremio.exec.ExecConstants, com.dremio.exec.catalog, com.dremio.exec.compile, com.dremio.exec.expr, com.dremio.exec.physical, com.dremio.exec.planner.physical, com.dremio.exec.server.options, com.dremio.exec.store, com.dremio.exec.store.dfs.implicit.ImplicitFilesystemColumnFinder, com.dremio.exec.rpc.user.security, com.dremio.sabot, com.dremio.sabot.op.aggregate.vectorized, com.dremio.sabot.rpc.user, com.dremio.service.jobs, com.dremio.plugins.mongo, com.dremio.service.execselector.ExecutorSelectionService, com.dremio.datastore, com.dremio.exec.store.hive, com.dremio.plugins.elastic, com.dremio.exec.store, org.apache.hadoop.hive] in locations [jar:file:/opt/dremio/jars/dremio-ce-services-cachemanager-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-azure-storage-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-sabot-kernel-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-options-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-telemetry-api-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-client-base-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-namespace-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-adls-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-mongo-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-telemetry-utils-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-s3-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-yarn-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-users-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-jdbc-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-hive2-plugin-launcher-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-hdfs-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-resourcescheduler-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-hive-plugin-common-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-dac-daemon-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-elasticsearch-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-provision-common-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-configuration-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-jdbc-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-nas-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-sabot-joust-java-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-coordinator-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-mongo-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-services-execselector-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-hive3-plugin-launcher-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-hive2-plugin-launcher-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-ce-hive3-plugin-launcher-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-elasticsearch-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/dremio-pdfs-plugin-4.3.0-202005130340290582-323f05d9.jar!/, jar:file:/opt/dremio/jars/3rdparty/dremio-hive2-exec-shaded-4.3.0-202005130340290582-323f05d9.jar!/] took 1551ms
2020-11-05 04:42:44,802 [main] INFO c.d.d.a.LegacyKVStoreProviderAdapter - Starting LegacyKVStoreProviderAdapter.
2020-11-05 04:42:44,803 [main] INFO c.d.d.a.LegacyKVStoreProviderAdapter - Starting underlying KVStoreProvider.
2020-11-05 04:42:44,803 [main] INFO c.d.datastore.LocalKVStoreProvider - Starting LocalKVStoreProvider
2020-11-05 04:42:44,818 [main] INFO c.d.d.a.LegacyKVStoreProviderAdapter - Stopping LegacyKVStoreProviderAdapter by stopping underlying KVStoreProvider.
2020-11-05 04:42:44,818 [main] INFO c.d.datastore.LocalKVStoreProvider - Stopping LocalKVStoreProvider
2020-11-05 04:42:44,822 [main] ERROR ROOT - Dremio is exiting. Failure while starting services.
com.dremio.datastore.DatastoreException: Process user (dremio) doesn't match local catalog db owner (root). Please run process as root.
at com.dremio.datastore.ByteStoreManager.verifyDBOwner(ByteStoreManager.java:165)
at com.dremio.datastore.ByteStoreManager.start(ByteStoreManager.java:192)
at com.dremio.datastore.CoreStoreProviderImpl.start(CoreStoreProviderImpl.java:171)
at com.dremio.datastore.LocalKVStoreProvider.start(LocalKVStoreProvider.java:151)
at com.dremio.datastore.adapter.LegacyKVStoreProviderAdapter.start(LegacyKVStoreProviderAdapter.java:65)
at com.dremio.dac.cmd.upgrade.Upgrade.run(Upgrade.java:181)
at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:141)
Suppressed: java.lang.IllegalStateException: #start was not invoked, so metadataManager is not available
at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
at com.dremio.datastore.ByteStoreManager.getMetadataManager(ByteStoreManager.java:424)
at com.dremio.datastore.ByteStoreManager.close(ByteStoreManager.java:431)
at com.dremio.common.AutoCloseables.close(AutoCloseables.java:126)
at com.dremio.common.AutoCloseables.close(AutoCloseables.java:76)
at com.dremio.datastore.CoreStoreProviderImpl.close(CoreStoreProviderImpl.java:258)
at com.dremio.datastore.LocalKVStoreProvider.close(LocalKVStoreProvider.java:195)
at com.dremio.datastore.adapter.LegacyKVStoreProviderAdapter.close(LegacyKVStoreProviderAdapter.java:84)
at com.dremio.dac.cmd.upgrade.Upgrade.run(Upgrade.java:184)
... 1 common frames omitted
Dremio is exiting. Failure while starting services.
com.dremio.datastore.DatastoreException: Process user (dremio) doesn't match local catalog db owner (root). Please run process as root.
at com.dremio.datastore.ByteStoreManager.verifyDBOwner(ByteStoreManager.java:165)
at com.dremio.datastore.ByteStoreManager.start(ByteStoreManager.java:192)
at com.dremio.datastore.CoreStoreProviderImpl.start(CoreStoreProviderImpl.java:171)
at com.dremio.datastore.LocalKVStoreProvider.start(LocalKVStoreProvider.java:151)
at com.dremio.datastore.adapter.LegacyKVStoreProviderAdapter.start(LegacyKVStoreProviderAdapter.java:65)
at com.dremio.dac.cmd.upgrade.Upgrade.run(Upgrade.java:181)
at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:141)
Suppressed: java.lang.IllegalStateException: #start was not invoked, so metadataManager is not available
at com.google.common.base.Preconditions.checkState(Preconditions.java:508)
at com.dremio.datastore.ByteStoreManager.getMetadataManager(ByteStoreManager.java:424)
at com.dremio.datastore.ByteStoreManager.close(ByteStoreManager.java:431)
at com.dremio.common.AutoCloseables.close(AutoCloseables.java:126)
at com.dremio.common.AutoCloseables.close(AutoCloseables.java:76)
at com.dremio.datastore.CoreStoreProviderImpl.close(CoreStoreProviderImpl.java:258)
at com.dremio.datastore.LocalKVStoreProvider.close(LocalKVStoreProvider.java:195)
at com.dremio.datastore.adapter.LegacyKVStoreProviderAdapter.close(LegacyKVStoreProviderAdapter.java:84)
at com.dremio.dac.cmd.upgrade.Upgrade.run(Upgrade.java:184)
... 1 more
So as I assume the newly restore DB folder was under the root
owner and while dremio is running with dremio:dremio
user:group this is happening.
I tried multiple things but all in vain and the master does not start for me and in all these tries - I did the runAsUser: 0
(i.e. run as root) for just the master pod - that did bring the master on - but all operation in the dremio portal were failing for NativeIOException: Operation not permitted
. So I went into the Helm _v2
folder and added the runAsUser: 0
part to each and every container that will spin up as part of this Helm installation.
Now I observe that everything works like a charm - which is great as I got it working and the backup files were restored - working well - so we didn’t lose any work - also upgraded and moved to new AKS using Helm.
My concern is:
- Running with root privileges in a container is not a good practice, thoughts?
- Because the way it works is via lots of edits in the Helm package - I will have to keep doing that for any future Helm package updates and maintain this custom solution going forward - defies the purpose of having to use a Helm package after all.
I need some help here so I can go back to the unedited Helm release and still have the backup /DB content intact.