Dremio ebs volume snapshot fail

On my dremio enterprise aws edition I had to change ebs volume disk size because it was filled time to time, I have increased it from 50 GB(default) to 100 Gb.
After that autobackup fails all time on snapshoting that disk with timeout error

c.dremio.dac.server.AwsConfigurator - Creating a new EBS snapshot
2021-05-13 11:23:54,160 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Snapshot created CreateSnapshotResponse(Description=EBS volume snapshot for project 704b4030-c493-438a-b4c5-ffa816f05ae1, Encrypted=false,
OwnerId=729301916023, Progress=, SnapshotId=snap-0be9a2b5a6603e95e, StartTime=2021-05-13T11:23:54Z, State=pending, VolumeId=vol-09a2665d3130fcadc, VolumeSize=100, Tags=[Tag(Key=dremio_project_id, Value=704b4030-c493-438a-b
4c5-ffa816f05ae1), Tag(Key=dremio_managed, Value=true), Tag(Key=dremio_version, Value=14.0.0-202103011714040666-9a0c2e10), Tag(Key=dremio_aws_edition, Value=ENTERPRISE)])
2021-05-13 11:23:54,160 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Waiting for the snapshot to be completed.
2021-05-13 11:23:54,270 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:23:55,311 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:23:56,506 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:23:57,567 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:23:58,627 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%

2021-05-13 11:28:53,110 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:28:54,165 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:28:55,370 [scheduler-19] INFO c.d.dac.resource.AwsBackupService - Error creating snapshot Snapshot completion timed out. null [com.dremio.dac.server.AwsConfigurator.createSnapshot(AwsConfigurator.java:1846),
com.dremio.dac.resource.AwsBackupService.backupProject(AwsBackupService.java:186), com.dremio.dac.resource.AwsBackupService$AutoBackupPolicy.run(AwsBackupService.java:339), com.dremio.service.scheduler.LocalSchedulerServi
ce$CancellableTask.run(LocalSchedulerService.java:191), java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511), java.util.concurrent.FutureTask.run(FutureTask.java:266), java.util.concurrent.ScheduledThrea
dPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180), java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293), java.util.concurrent.Threa
dPoolExecutor.runWorker(ThreadPoolExecutor.java:1149), java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624), java.lang.Thread.run(Thread.java:748)]
2021-05-13 11:28:55,387 [scheduler-19] INFO c.d.dac.resource.AwsBackupService - User Error Occurred [ErrorId: 932b31e0-0ff4-4e07-9bfa-f48a75139737] />

I was able to create snapshot from Dremio instance using AWS cli.
Can you help me to understand what is the issue and how to fix it?

Thanks

Any help on this?

Thanks

@VahagnBleyan Have checked internally on this and will get back to you on this

@VahagnBleyan The AWS backup service has a fixed timeout once we send a create snapshot request, because after we send the request, the snapshot appears in the list of backups, but we can’t restore from that snapshot unless its 100% complete. Hence we wait until the snapshot is complete, and that is what you are seeing. The delay in creating the snapshot can be from the AWS end. It depends on number of blocks being modified on the volume you are trying to snapshot since the last snapshot, so if you are writing a lot of data on the volume, this can take some time.
As for now, there is no way to configure the timeout. Using the AWS cli would be the only option. Let me know if this helps.

Hi guys, thanks for replay,
@Rafay. the issue comes after volume resize on aws, and we don’t have much disk i/o on dremio instance. Also as I said I was able to create snapshot with AWS cli from dremio instance. So what issue can be on AWS side that your autobackup fails all time?
Will dremio be able to use manually created snapshots with AWS cli with all required tags?
I mean if I will try to upgrade to new dremio version will everything go smoothly or I will have problems with snapshot?

Thanks

Hi @VahagnBleyan. If you look at the logs that you shared:

2021-05-13 11:23:54,270 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:23:55,311 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
…
2021-05-13 11:28:53,110 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%
2021-05-13 11:28:54,165 [scheduler-19] INFO c.dremio.dac.server.AwsConfigurator - Progress 0%

The progress shown here is the progress of completion of snapshot. If you look at the timestamp of the log messages, the backup service waited 5 minutes before bailing out, during which there was no progress from aws side (still 0%).
When you send a manual command from the instance itself, there is no timeout, hence your snapshot is created successfully.
The manual snapshot can be used to restore, but the restore may have to be done manually. The restore from Dremio’s project UI page may not work for manual snapshots since they were not taken by the backup service.

Hi @Rafay , so there some timeout set on your backup service, this is why it’s fails after 5 min? Because as I know the default timeout for aws is 10 min. Can I change somehow that timeout interval?
Also will be better to remove that timeout setting from backup service.
And I noticed that snapshot progress shows always 0% until the end when just changed 0 to 100% and complete with success.

Hi @Rafay, so there is no any possibility to change that timeout setting in dremio auto backup service?

@VahagnBleyan Unfortunately at this point, there is no way to configure the timeout. I will check internally with the team if we plan to make that configurable in the future releases.
In the mean time, you can try creating a manual snapshot and apply tags to the snapshot.

I already do so, and curious thing is if I’m doing manual backup from instance dremio autobackup also finish with success.

Thanks
Vahagn

I was expecting that the dremio backup service should work fine after the first manual backup. As i said in my first comment, the long delay in the snapshot is possibly due to large number of modified blocks on the volume in your snapshot. The way AWS snapshot works is that it takes the delta of blocks. modified from the last snapshot. So your first snapshot might be very slow, but the successive snapshots would be faster.