Migrate dremio config to new cluster

I’ve been working on migrating our Dremio deployment from one AWS EKS cluster to another and have most issues resolved but for the remaining copy of config from the old cluster to the new one. I’ve tried both bin/dremio-admin backup/ bin/dremio-adminrestore .

No problem getting into my master pod on the old cluster and running bin/dremio-admin backup, capturing that in a compressed tar file, then transferring it to the new cluster and unpacking the tar file. However, I’ve run into some show stoppers when trying to restore this backup to Dremio running in the new cluster.

The restore instructions state that Dremio must be stopped, which is where the problems started. Although I am able to connect to the cluster via web browser, bin/dremio status reports that dremio is stopped on all nodes (master, coodinators and executors). I can issue bin/dremio restartand get statusto report that Dremio is started, then issue bin/dremio stop.

When I attempt to stop Dremio, the command pauses for a long time with a trace of ……. before finally reporting that it’s trying a kill -9 <PID>. After bin/dremio stop tries its kill -9, I can see from ps aux | grep dremio that I’ve still got some Dremio processes running.

I’ve also tried restoring my backup at this point but get an error that RocksDB is locked and the bin/dremio-admin restore process hangs for a long time waiting for the lock to be removed.

After many cycles of trying and waiting for the DB lock to be removed I brute force removed the the lock file mentioned in the warning and proceeded with bin/dremio-admin restore, which ran.

However after running my master pod was pretty hosed. I couldn’t make it reachable via web interface after multiple attempts to restart and ended up tearing down my deployment in k8s and rebuilding.

What is the best way to copy config from one cluster to another? I assume others have needed to do this for reasons similar to mine. Are there clear instructions on how to do this anywhere?

I wanted to also mention that I’ve been trying to accomplish migration of users, connections and reflections using dremio-clone, but have also had issues with that. So far, all I can get it to do is create empty folders for the two Dremio Spaces that exist on my old cluster. I can see in my (verbose) logs that all of the child directories and views appear as a DEBUG entry but no error is given. I really don’t know why stuff isn’t being created on the target cluster.

Any help would be greatly appreciated.

For k8s clusters, you’ll need to enable DremioAdmin mode to do a restore. DremioAdmin mode ensures that Dremio is shut down properly for a restore.

helm upgrade --wait <chart release name> . --set DremioAdmin=true

Read more here:

PS: Make sure you’re restoring to the same Dremio version you backed up from.

Thank you @lenoyjacob ! I’ll give this a shot and be in touch!

Thanks again for your help @lenoyjacob . We enabled DremioAdmin and attempted a restore from our other cluster (required actually rebuilding the new cluster to start clean), but got the following errors in the terminal when running the restore:

Restored from backup at /tmp/dremio_bk/dremio_backup_2025-08-04_19.57, 41 dremio tables, 17 files.

Per file exceptions:
Restore failed for the 'embedded_pointers' table backup
Restore failed for the 'configuration' table backup
Restore failed for the 'dac-namespace' table backup
Restore failed for the 'catalog-source-data' table backup

I’m also seeing the following Java errors in dremio-master-0 logs:

java.lang.RuntimeException: Failed to retrieve JWK keystore entry password reference from configuration store
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getPrivateKeyPassword(SystemJWKSetManager.java:542)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getKeyEntryAttributes(SystemJWKSetManager.java:394)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.lambda$loadKeysFromKeystore$2(SystemJWKSetManager.java:317)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
    at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.loadKeysFromKeystore(SystemJWKSetManager.java:360)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getRotatingKey(SystemJWKSetManager.java:236)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.enforceConsistentJwksState(SystemJWKSetManager.java:592)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.start(SystemJWKSetManager.java:190)
    at com.dremio.service.SingletonRegistry$AbstractServiceReference.start(SingletonRegistry.java:166)
    at com.dremio.service.ServiceRegistry.start(ServiceRegistry.java:90)
    at com.dremio.service.SingletonRegistry.start(SingletonRegistry.java:47)
    at com.dremio.dac.daemon.DACDaemon.startServices(DACDaemon.java:214)
    at com.dremio.dac.daemon.DACDaemon.init(DACDaemon.java:220)
    at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:125)
Dremio is exiting. Failure while starting services.
java.lang.RuntimeException: Failed to retrieve JWK keystore entry password reference from configuration store
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getPrivateKeyPassword(SystemJWKSetManager.java:542)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getKeyEntryAttributes(SystemJWKSetManager.java:394)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.lambda$loadKeysFromKeystore$2(SystemJWKSetManager.java:317)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
    at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.loadKeysFromKeystore(SystemJWKSetManager.java:360)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getRotatingKey(SystemJWKSetManager.java:236)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.enforceConsistentJwksState(SystemJWKSetManager.java:592)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.start(SystemJWKSetManager.java:190)
    at com.dremio.service.SingletonRegistry$AbstractServiceReference.start(SingletonRegistry.java:166)
    at com.dremio.service.ServiceRegistry.start(ServiceRegistry.java:90)
    at com.dremio.service.SingletonRegistry.start(SingletonRegistry.java:47)
    at com.dremio.dac.daemon.DACDaemon.startServices(DACDaemon.java:214)
    at com.dremio.dac.daemon.DACDaemon.init(DACDaemon.java:220)
    at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:125)

Enable verbose logging for the restore and try again.

Also, what’s in the data/security/directory of the restored cluster?

dremio@dremio-master-0:/opt/dremio/data/security$ ls -lh
total 8.0K
-rw------- 1 dremio dremio  411 Apr 10  2024 system.p12
-rw------- 1 dremio dremio 1.5K Jul 29 22:19 token-manager-keystore.p12

I’ll need to tear down and rebuild my target/new cluster to give you output from verbose logging. It didn’t come back up after the last restore attempt.

I’ve torn down, rebuilt and re-ran the restore with verbose logging enabled:

dremio@dremio-admin:/opt/dremio$ export DREMIO_ADMIN_LOG_VERBOSITY=INFO
dremio@dremio-admin:/opt/dremio$ echo $DREMIO_ADMIN_LOG_VERBOSITY
INFO

Restore was run as follows with attached errors:

dremio@dremio-admin:/opt/dremio$ bin/dremio-admin restore -d /tmp/dremio_bk/dremio_backup_2025-08-04_19.57
Restored from backup at /tmp/dremio_bk/dremio_backup_2025-08-04_19.57, 41 dremio tables, 17 files.

Per file exceptions:
Restore failed for the 'embedded_pointers' table backup
Restore failed for the 'configuration' table backup
Restore failed for the 'dac-namespace' table backup
Restore failed for the 'catalog-source-data' table backup

Can we set verbosity to something larger than INFO? Something like DEBUG will work. Also ensure DREMIO_ADMIN_LOG_DIR is set (see the same doc page)

Thanks for the tip, DREMIO_ADMIN_LOG_VERBOSITYset to DEBUGand I captured logs. The log file was over 161k lines but here is a capture of the end where I began to see Java errors:

2025-08-12 00:14:54,767 [shutdown-hook-0] DEBUG c.d.plugins.util.CloseableResource - Closing resource AmazonS3Client
2025-08-12 00:14:54,767 [shutdown-hook-0] DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager - Connection manager is shutting down
2025-08-12 00:14:54,767 [shutdown-hook-0] DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager - Connection manager shut down
2025-08-12 00:14:54,767 [shutdown-hook-0] DEBUG o.a.h.f.s.AWSCredentialProviderList - Closing AWSCredentialProviderList[refcount= 0: [SimpleAWSCredentialsProvider] last provider: SimpleAWSCredentialsProvider
2025-08-12 00:14:54,767 [java-sdk-http-connection-reaper] DEBUG c.a.http.IdleConnectionReaper - Reaper thread: 
java.lang.InterruptedException: sleep interrupted
        at java.base/java.lang.Thread.sleep(Native Method)
        at com.amazonaws.http.IdleConnectionReaper.run(IdleConnectionReaper.java:188)
2025-08-12 00:14:54,767 [java-sdk-http-connection-reaper] DEBUG c.a.http.IdleConnectionReaper - Shutting down reaper thread.
2025-08-12 00:14:54,767 [Thread-2] DEBUG o.a.hadoop.util.ShutdownHookManager - Completed shutdown in 0.007 seconds; Timeouts: 0
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG c.a.h.c.ClientConnectionManagerFactory - 
java.lang.reflect.InvocationTargetException: null
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at com.amazonaws.http.conn.ClientConnectionManagerFactory$Handler.invoke(ClientConnectionManagerFactory.java:76)
        at com.amazonaws.http.conn.$Proxy35.connect(Unknown Source)
        at org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
        at com.amazonaws.http.apache.client.impl.SdkHttpClient.execute(SdkHttpClient.java:72)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1346)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5558)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1403)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$11(S3AFileSystem.java:2676)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:431)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2664)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2644)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3735)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3663)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteWithoutCloseCheck(S3AFileSystem.java:3265)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:3239)
        at com.dremio.plugins.util.ContainerFileSystem.delete(ContainerFileSystem.java:399)
        at com.dremio.exec.hadoop.HadoopFileSystem.delete(HadoopFileSystem.java:408)
        at com.dremio.io.file.FileSystemUtils$1.run(FileSystemUtils.java:60)
Caused by: java.io.InterruptedIOException: Connection already shutdown
        at org.apache.http.impl.conn.DefaultManagedHttpClientConnection.bind(DefaultManagedHttpClientConnection.java:118)
        at org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:135)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:376)
        ... 36 common frames omitted
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG o.a.h.i.c.DefaultManagedHttpClientConnection - http-outgoing-40: Shutdown connection
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG o.a.h.impl.execchain.MainClientExec - Connection discarded
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG o.a.h.i.c.PoolingHttpClientConnectionManager - Connection released: [id: 40][route: {s}->https://<bucket>.s3.us-east-2.amazonaws.com:443][total available: 0; route allocated: 0 of 96; total allocated: 0 of 96]
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG com.amazonaws.http.AmazonHttpClient - Unable to execute HTTP request: Connection already shutdown Request will be retried.
2025-08-12 00:14:54,768 [files-delete-on-exit] DEBUG com.amazonaws.request - Retrying Request: HEAD https://<bucket>.s3.us-east-2.amazonaws.com /dremio/uploads/_staging.dremio-admin
2025-08-12 00:14:54,768 [files-delete-on-exit] WARN  o.a.h.f.s.AWSCredentialProviderList - Credentials requested after provider list was closed
2025-08-12 00:14:54,769 [files-delete-on-exit] DEBUG com.amazonaws.latency - ServiceName=[Amazon S3], ServiceEndpoint=[https://<bucket>.s3.us-east-2.amazonaws.com], Exception=[java.io.InterruptedIOException: Connection already shutdown, org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: Credentials requested after provider list was closed], RequestType=[GetObjectMetadataRequest], AWSRequestID=[null], HttpClientPoolPendingCount=0, RetryCapacityConsumed=0, HttpClientPoolAvailableCount=0, RequestCount=2, Exception=2, HttpClientPoolLeasedCount=0, ClientExecuteTime=[7.669], HttpRequestTime=[6.842], ApiCallLatency=[7.345], RequestSigningTime=[0.161], CredentialsRequestTime=[0.118, 0.001, 0.073], 
2025-08-12 00:14:54,769 [files-delete-on-exit] WARN  com.dremio.io.file.FileSystemUtils - Could not delete path <bucket>/dremio/uploads/_staging.dremio-admin
java.nio.file.AccessDeniedException: <bucket>/dremio/uploads/_staging.dremio-admin: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: Credentials requested after provider list was closed
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:215)
        at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:174)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3760)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3663)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.deleteWithoutCloseCheck(S3AFileSystem.java:3265)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.delete(S3AFileSystem.java:3239)
        at com.dremio.plugins.util.ContainerFileSystem.delete(ContainerFileSystem.java:399)
        at com.dremio.exec.hadoop.HadoopFileSystem.delete(HadoopFileSystem.java:408)
        at com.dremio.io.file.FileSystemUtils$1.run(FileSystemUtils.java:60)
Caused by: org.apache.hadoop.fs.s3a.auth.NoAuthWithAWSException: Credentials requested after provider list was closed
        at org.apache.hadoop.fs.s3a.AWSCredentialProviderList.getCredentials(AWSCredentialProviderList.java:166)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.getCredentialsFromContext(AmazonHttpClient.java:1269)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeOneRequest(AmazonHttpClient.java:1290)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeHelper(AmazonHttpClient.java:1157)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.doExecute(AmazonHttpClient.java:814)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.executeWithTimer(AmazonHttpClient.java:781)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.execute(AmazonHttpClient.java:755)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutor.access$500(AmazonHttpClient.java:715)
        at com.amazonaws.http.AmazonHttpClient$RequestExecutionBuilderImpl.execute(AmazonHttpClient.java:697)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:561)
        at com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:541)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5558)
        at com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:5505)
        at com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:1403)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.lambda$getObjectMetadata$11(S3AFileSystem.java:2676)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:468)
        at org.apache.hadoop.fs.s3a.Invoker.retryUntranslated(Invoker.java:431)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2664)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.getObjectMetadata(S3AFileSystem.java:2644)
        at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3735)
        ... 6 common frames omitted
2025-08-12 00:14:54,777 [Thread-2] DEBUG o.a.hadoop.util.ShutdownHookManager - ShutdownHookManager completed shutdown.

A few hundred lines up I also see this:

Restore failed for the 'embedded_pointers' table backup 
Restore failed for the 'configuration' table backup 
Restore failed for the 'dac-namespace' table backup 
Restore failed for the 'catalog-source-data' table backup 

2025-08-12 00:14:54,760 [files-delete-on-exit] DEBUG o.apache.hadoop.fs.s3a.S3AFileSystem - Getting path status for s3a://<bucket>/dremio/uploads/_staging.dremio-admin  (dremio/uploads/_staging.dremio-admin); needEmptyDirectory=true
2025-08-12 00:14:54,760 [files-delete-on-exit] DEBUG o.apache.hadoop.fs.s3a.S3AFileSystem - S3GetFileStatus s3a://<bucket>/dremio/uploads/_staging.dremio-admin
2025-08-12 00:14:54,761 [files-delete-on-exit] DEBUG o.apache.hadoop.fs.s3a.S3AFileSystem - HEAD dremio/uploads/_staging.dremio-admin with change tracker null
2025-08-12 00:14:54,761 [files-delete-on-exit] DEBUG o.a.h.f.s.audit.impl.LoggingAuditor - [91] fe45a022-ae11-45bf-917f-ad92c9a61231-00000098 Executing op_delete with {action_http_head_request 'dremio/uploads/_staging.dremio-admin' size=0, mutating=false}; https://audit.example.org/hadoop/1/op_delete/fe45a022-ae11-45bf-917f-ad92c9a61231-00000098/?op=op_delete&p1=s3a://<bucket>/dremio/uploads/_staging.dremio-admin&pr=dremio&ps=9d90e23c-0f1d-4997-8f95-5068bfc70344&id=fe45a022-ae11-45bf-917f-ad92c9a61231-00000098&t0=91&fs=fe45a022-ae11-45bf-917f-ad92c9a61231&t1=91&ts=1754957694760
2025-08-12 00:14:54,761 [shutdown-hook-0] DEBUG org.apache.hadoop.fs.FileSystem - FileSystem.close() by method: org.apache.hadoop.fs.FilterFileSystem.close(FilterFileSystem.java:529)); Key: (dremio (auth:SIMPLE))@file://; URI: file:///; Object Identity Hash: 4fda7e81
2025-08-12 00:14:54,761 [files-delete-on-exit] DEBUG com.amazonaws.request - Sending Request: HEAD https://<bucket>.s3.us-east-2.amazonaws.com /dremio/uploads/_staging.dremio-admin
2025-08-12 00:14:54,761 [shutdown-hook-0] DEBUG org.apache.hadoop.fs.FileSystem - FileSystem.close() by method: org.apache.hadoop.fs.RawLocalFileSystem.close(RawLocalFileSystem.java:895)); Key: null; URI: file:///; Object Identity Hash: 411d990a
2025-08-12 00:14:54,761 [files-delete-on-exit] DEBUG com.amazonaws.auth.AWS4Signer - AWS4 Canonical Request: '"HEAD 
/dremio/uploads/_staging.dremio-admin

Note that I’ve replaced my bucket name with in logs. I noticed mention of a folder in S3 at path dremio/uploads/_staging.dremio-adminquite a few times, not sure if that’s important but did check S3 and don’t see a folder with that name at that path.

Took a stab in dark and manually created an empty directory at dremio/uploads/_staging.dremio-adminand tore down/rebuilt our deployment, then set env vars and re-ran the restore. I’m getting the same failures in the restore with failures on embedded_pointers, configuration, dac-namespace and catalog-source-data all continuing to fail.

I rebooted the cluster to disable admin mode and we’re still not getting Dremio to start. I see the following Java errors in dremio-master-0`

java.lang.RuntimeException: Failed to retrieve JWK keystore entry password reference from configuration store
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getPrivateKeyPassword(SystemJWKSetManager.java:542)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getKeyEntryAttributes(SystemJWKSetManager.java:394)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.lambda$loadKeysFromKeystore$2(SystemJWKSetManager.java:317)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
    at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.loadKeysFromKeystore(SystemJWKSetManager.java:360)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getRotatingKey(SystemJWKSetManager.java:236)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.enforceConsistentJwksState(SystemJWKSetManager.java:592)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.start(SystemJWKSetManager.java:190)
    at com.dremio.service.SingletonRegistry$AbstractServiceReference.start(SingletonRegistry.java:166)
    at com.dremio.service.ServiceRegistry.start(ServiceRegistry.java:90)
    at com.dremio.service.SingletonRegistry.start(SingletonRegistry.java:47)
    at com.dremio.dac.daemon.DACDaemon.startServices(DACDaemon.java:214)
    at com.dremio.dac.daemon.DACDaemon.init(DACDaemon.java:220)
    at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:125)
Dremio is exiting. Failure while starting services.
java.lang.RuntimeException: Failed to retrieve JWK keystore entry password reference from configuration store
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getPrivateKeyPassword(SystemJWKSetManager.java:542)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getKeyEntryAttributes(SystemJWKSetManager.java:394)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.lambda$loadKeysFromKeystore$2(SystemJWKSetManager.java:317)
    at java.base/java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:195)
    at java.base/java.util.Iterator.forEachRemaining(Iterator.java:133)
    at java.base/java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
    at java.base/java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:484)
    at java.base/java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:474)
    at java.base/java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:913)
    at java.base/java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.base/java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:578)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.loadKeysFromKeystore(SystemJWKSetManager.java:360)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.getRotatingKey(SystemJWKSetManager.java:236)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.enforceConsistentJwksState(SystemJWKSetManager.java:592)
    at com.dremio.service.tokens.jwks.SystemJWKSetManager.start(SystemJWKSetManager.java:190)
    at com.dremio.service.SingletonRegistry$AbstractServiceReference.start(SingletonRegistry.java:166)
    at com.dremio.service.ServiceRegistry.start(ServiceRegistry.java:90)
    at com.dremio.service.SingletonRegistry.start(SingletonRegistry.java:47)
    at com.dremio.dac.daemon.DACDaemon.startServices(DACDaemon.java:214)
    at com.dremio.dac.daemon.DACDaemon.init(DACDaemon.java:220)
    at com.dremio.dac.daemon.DremioDaemon.main(DremioDaemon.java:125)

Not sure how we fix the issue with Dremio not finding or setting keystore properly after we do a restore. Would Dremio access its keystore locally (somewhere under /opt/dremio) or from its S3 bucket?

What happens when you delete this file & restart on the restored cluster?

/opt/dremio/data/security/token-manager-keystore.p12

It started!

Looks to be healthy and I went to the endpoint, logged in with my username and password (from the old cluster, didn’t exist previously on the new cluster) and I can see all the user accounts were created.

None of the connections or views were migrated tho.

Step in the right direction in any case!

I ran another test after checking md5sums on the token-manager-keystore.p12file. I found that the file on my target (new) cluster changed to match the checksum found on the cluster I’m migrating away from.

I backed up the original file on the target cluster token-manager-keystore.p12. Then, after running my restore, I tried copying the original keystore file back to /opt/dremio/data/security, then rebooted to non-admin mode. Unfortunately, the cluster isn’t starting with its original keystore file.

The only success I’ve had in getting Dremio to start after a restore is to delete /opt/dremio/data/security/token-manager-keystore.p12, but then most of what we need from the restore isn’t available (connections, reflections and views).

I’m wondering if the backup is somehow corrupted. Are you able to retry taking the backup? Also attaching the (sanitized) restore log here (or DM’ing me) also would help.

I can certainly retry taking a backup. Earlier you pointed out that since I’m running my cluster in EKS I needed to switch to admin mode before running a restore. When I created my backup from the old cluster I simply shelled into my dremio-master-0pod and issued backup command like this:

bin/dremio-admin backup -u <user> -p <pwd> -d /tmp/dremio_bk

I wanted to make sure I don’t need to have dremio services shut down or the cluster in admin mode before making the backup.

Also, I used my username and password to do the backup. There’s no admin or dremio account on the database but I don’t think it matters since its community edition. Are there any account/permissions issues to be aware of? I didn’t see any message indicating there was a permissions issue.

Thanks so much for your help!

No, in fact Dremio has to be online for the backup to happen. Make sure the PV has sufficient disk space.

No, it should’ve thrown you an error if there was.

Cool, I’ll rerun the backup today, then try again with a rebuilt target cluster.

I redid my backup and tried to restore to a rebuilt cluster, but seeing the same issues. I’m working on sanitizing the logs, in the meantime is there anything else I might try?