[Question] About Reflections, C3 and Engines

Hello, everyone! I’ve already read the Dremio documentation about the main features such as Reflections, Cloud Cache, Apache Arrow and the Dynamic engines that can auto-start and auto-stop.

I’m using the AWS Edition.

When I create a reflection, it gets stored in the distributed storage, such as S3. My questions are:

  • Do they get stored in the distributed storage wether they have Arrow caching or not?
  • When I have a computing engine turned on, does it download the reflection/cache to the local NVME SSD to use? (I believe this is the cloud caching)
  • When this same computing engine goes offline to save money, it loses the local cache because the NVME SSD is temporary?
  • If yes, when it automatically turns on again, does it have to download the cache again in order to serve the query?

I’d like to know if you have any documentation, if I missed anything, or if you can give me more details about how Dremio makes use of these technologies in practice.

Thanks in advance for your attention.

@mendeel

    • Do they get stored in the distributed storage wether they have Arrow caching or not? - YES
  • When I have a computing engine turned on, does it download the reflection/cache to the local NVME SSD to use? As you fire queries, the cache will be warmed

Will get back on the other questions soon

Thanks
Bali

  • When this same computing engine goes offline to save money, it loses the local cache because the NVME SSD is temporary?

    • Yes when engines go offline the cache is lost for ever, they terminate. I have asked they consider supporting stopping those engine nodes, when stopped they have much lower hourly cost, the tradeoff in running the instance versus rebuilding the cache could be better support for that feature. Smaller deployments with simple reflection strategies could benefit from having a single node engine with some cloud caching available from multiple data sources.
  • If yes, when it automatically turns on again, does it have to download the cache again in order to serve the query?

    • Yes, because it is terminated all AWS resources are terminated. It does not auto rebuild or re-establish the cache until queries are run. The fastest way to do that may be to run reflection refreshes using the engine on startup. Remember that JDBC data sources are single threaded over multiple connections, you only need one engine node.
      • Fastest rebuild may be to send a set of REST API queries on engine startup. You could probably create a Lambda function to schedule that at T-minus 5 minutes from when users may start querying. You could also make a Lambda function that could read the Dremio logs, supposing you have them going to CloudWatch, then proactively run those queries on startup.
      • A better design principle overall (for Dremio) would be to engage cloud cache on a single node from S3 to the node, then scale the nodes up on the same cluster so they can self replicate. That may be faster or cheaper than loading in parallel from S3, I’m not sure as I don’t have that scale of data. Perhaps there is a Kubernetes OSS edition to validate whether that can be done in-house. The order of when parallelism or I/O throttling makes more of a difference in time or money may matter.
      • Some AWS instance types also have faster I/O during the first startup hour, for a lower tech approach you could make the most of that hour between S3 and the node, then leave it running the rest of the day. I’m not sure if that I/O faster first hour applies to the limited set of instances Dremio AWS Edition supports.
1 Like