Dremio storage ha

Hi i would like to understand how dremio works while interacting with storage to retrieve reflections when one of the executor and coordinator node has those data in local disk.

Lets say i have configured distributed storage to pdfs and local disk only. When i enabled reflection for a data set and lets say that reflection job is running on node1 which is an exexutor node and i have 2 more executor nodes as node2, node3 all configured with HA through external zk. The node1 reflection job materialized the data on node1 disk. Eveeyrhing works fine. Now when query request landa in machine 2 ie node2 executor it will try to pull reflection data from node1 local disk and execute the query in memory of node2. Now when node1 is down the local disk of node1 will also be inavailable. In this case when a query lands on node2 from where it will try to retrieve the reflection data, will it again start pulling from target source and again materialize as a redlection on node2, does it happen when query is being fired(it will slow down queey thread) or does it happen on a different thread and will the query thread wait or error out in such situation.

For better performance of queries ie getting sub seconds query performance storing in local disk will be optimal but how does it scale when the data size grows and nodes with disk goes down.

If you are using PDFS, if a reflection is located on node1 and node1 goes down, you will not be able to access the reflection. You can still run the query as normal, but it will not be accelerated. If you recreate the same reflection while node1 is down, it will store it in node2 and/or node3. Unfortunately there is no failover for using reflections stored on local disk, that is just a disadvantage of it. However, an advantage is read times are usually a lot faster than compared to cloud storage (ADLS/S3).

There are advantages/disadvantages of using any type of storage, really just depends on your needs.

Hi Anthony, I understand that part, but dont you think DremIO should provide a feature to handle this assuming reflection and performance is the utmost goal of building a product like DremIO.

  1. Query affinity to reflection store. Ie if a query is fired against a virtual data stire that has reflection enabled, it will try to go to executor node where reflection exist and if not found it should try to recreate again in the background and let the main query go to actual backend for the first time.

  2. Does it analyze if reading from a reflection stored in a cloud storage like S3 is time consuming than getting from direct source, it will go to direct source. When does this cost calculation happen.

  3. What is the recommended pattern for reflection store, There will be cases where we may loose a node then reflection is gone. Is there an api where we can force to execute the reflection job on a specific node. How do we enforce the reflection job should be executed in one specific node ie node1 or node2 etc.

  4. When a query is executed on a executor node where the reflection does not exist but the reflection exist in another node, in that case will executor where query is getting executed will fetch data from other node’s reflection store or will it go to direct source of data ?

Can you please answer 1, 2,3,4.

  1. Not available today. A reflection job is different from a regular query job. If we kick off a reflection job every time a reflection exists but not found (for any number of reasons), that could be a very costly feature. You could always try to automate this checking/update with our API.

  2. Yes - happens the moment a query is run, that is when our query plan is calculated.

  3. The majority of our Dremio’s customers run Dremio’s reflection store in HDFS, S3, or ADLS.

  4. It will go straight to the source since Dremio doesn’t know what part of the reflection (part of the data) is actually available on what node. We keep/store/understand the metadata behind every reflection, but not all the data.

We recommend use of highly available storage for the reflection store, such as HDFS, S3, ADLS, etc.

  1. This is effectively what happens. If a reflection is unavailable, Dremio will run the query on the physical dataset. Dremio’s reflection maintenance engine will refresh/recreate the reflection per the data freshness SLA (it will not initiate this task automatically). When using HDFS/S3/ADLS/etc reflections should not become unavailable.

  2. This cost determination happens at query time.

  3. See initial statement - Dremio decouples compute and storage layers. When using HDFS/S3/ADLS/etc then location of reflections should not be a concern.

  4. Same answer as 3.

In general I would discourage use of PDFS for production deployments.

Thanks Anthony and Kelly. I like the separation of compute and storage by design. That gives a lot of flexibility to scale and overcome operational challenges.

Does that mean if i run DremIO in a YARN setup with hdfs as storage, ill still not get benefit of data locality with reapect to compute.

Will DremIO + YARN on HDFS will be efficient or standalone DremIO cluster with ZK for HA on good machines configuration + S3 will be efficient.

It would have been good to store metadata and reflection storage as tiered ie first look up in local cache plus disk and if not found fetch from S3 for the partitions.

Slmething similar to how Presto Raptor on disk and flash works.

Yes, Dremio will take advantage of data locality in HDFS. Dremio via YARN will be efficient and this is how many of our customers deploy.

Dremio +S3 managed via kubernetes is also efficient and a popular deployment model.

Tiered storage of reflections is something on our roadmap. Lots of interesting options in this area.