I am currently trying to setup Dremio in AWS using EKS, but I came across a question for which I cannot find an answer.
Dremio recommends in the guides (EC2) Instances with local NVMe storage. But since V21 DistStorage = Local is not supported anymore.
If I select S3 as Storage, does that mean that I can use EC2 instances without local storage?
Or is the local NVMe storage still used for cashing?
Keep in mind that Distributed Storage and C3 (caching) are different.
You’re right in saying that the Distributed Storage needs to be “distributed” as of v21. So it should be on object stores like S3, Azure Storage, GCS or any storage with an S3-compatible API.
C3 on the other hand needs to be local (for performance reasons). So it should be on local storage (like on EBS, or on the ephemeral storages (NVMe/SSD) if your EC2 instance supports them).
So Dremio does not need instances with NVMe storage, but instead EBS is “enough”.
I have a second question. In case an EC2 instance dies/gets shutdown, and thus the EBS cache of the instance is lost, will Dremio run into problems, or will the performance just be a bit worse, until the cache is rebuilt?
EBS volumes are (generally) persistent. They can survive an EC2 shutdown/restart. So EKS will bound the respective EBS volume whenever the EC2 instance for a Dremio Executor comes back up. If for whatever reason, the executor’s EBS volume is also destroyed then the latter applies (i.e. performance will be a little slower until the cache is rebuilt based on the queries sent to Dremio).