[Question] About Reflections, C3 and Engines

mendeel · January 20, 2021, 3:15pm

Hello, everyone! I’ve already read the Dremio documentation about the main features such as Reflections, Cloud Cache, Apache Arrow and the Dynamic engines that can auto-start and auto-stop.

I’m using the AWS Edition.

When I create a reflection, it gets stored in the distributed storage, such as S3. My questions are:

Do they get stored in the distributed storage wether they have Arrow caching or not?
When I have a computing engine turned on, does it download the reflection/cache to the local NVME SSD to use? (I believe this is the cloud caching)
When this same computing engine goes offline to save money, it loses the local cache because the NVME SSD is temporary?
If yes, when it automatically turns on again, does it have to download the cache again in order to serve the query?

I’d like to know if you have any documentation, if I missed anything, or if you can give me more details about how Dremio makes use of these technologies in practice.

Thanks in advance for your attention.

balaji.ramaswamy · January 21, 2021, 9:05am

@mendeel

- Do they get stored in the distributed storage wether they have Arrow caching or not? - YES
When I have a computing engine turned on, does it download the reflection/cache to the local NVME SSD to use? As you fire queries, the cache will be warmed

Will get back on the other questions soon

Thanks
Bali

datocrats-org · May 6, 2021, 2:04am

When this same computing engine goes offline to save money, it loses the local cache because the NVME SSD is temporary?
- Yes when engines go offline the cache is lost for ever, they terminate. I have asked they consider supporting stopping those engine nodes, when stopped they have much lower hourly cost, the tradeoff in running the instance versus rebuilding the cache could be better support for that feature. Smaller deployments with simple reflection strategies could benefit from having a single node engine with some cloud caching available from multiple data sources.
If yes, when it automatically turns on again, does it have to download the cache again in order to serve the query?
- Yes, because it is terminated all AWS resources are terminated. It does not auto rebuild or re-establish the cache until queries are run. The fastest way to do that may be to run reflection refreshes using the engine on startup. Remember that JDBC data sources are single threaded over multiple connections, you only need one engine node.
  - Fastest rebuild may be to send a set of REST API queries on engine startup. You could probably create a Lambda function to schedule that at T-minus 5 minutes from when users may start querying. You could also make a Lambda function that could read the Dremio logs, supposing you have them going to CloudWatch, then proactively run those queries on startup.
  - A better design principle overall (for Dremio) would be to engage cloud cache on a single node from S3 to the node, then scale the nodes up on the same cluster so they can self replicate. That may be faster or cheaper than loading in parallel from S3, I’m not sure as I don’t have that scale of data. Perhaps there is a Kubernetes OSS edition to validate whether that can be done in-house. The order of when parallelism or I/O throttling makes more of a difference in time or money may matter.
  - Some AWS instance types also have faster I/O during the first startup hour, for a lower tech approach you could make the most of that hour between S3 and the node, then leave it running the rest of the day. I’m not sure if that I/O faster first hour applies to the limited set of instances Dremio AWS Edition supports.

Topic		Replies	Views
Dremio storage ha	8	1815	October 15, 2018
Reflection in AWS S3 is slow? store in EBS?	9	2299	June 29, 2018
Dremio C3 cache versus reflections	3	1590	December 12, 2023
Is the data for Physical and virtual data sets stored locally in dremio	7	2681	March 20, 2021
Caching - On Prem Implementation	5	1322	April 4, 2022

[Question] About Reflections, C3 and Engines

Related topics