Need information on what is Dremio’s Architecture, how it works, where are the reflections stored.
I am integrating Dremio with Hadoop/Hive, I want to understand how the communication between coordinator and executor takes place.
What is the process when we submit a query in Dremio, how it fetches the data in back end.
I need a detailed work flow so that i can present it in front of my clients.
Hi Zubair, what specific questions do you have about how Dremio works?
Here are some things that may be helpful:
Dremio is built on Apache Arrow, Apache Parquet, and Apache Calcite.
Dremio reads data from any source into Arrow buffers for in-memory processing.
Dremio has its own SQL engine called Sabot for executing queries, or augmenting the abilities of the underlying data source (eg, Elastic doesn’t support JOIN, so Dremio pushes down what it can and processes other query fragments in Sabot).
Data Reflections persist as Parquet files on a file system - HDFS, S3, ADLS, NAS.
Dremio has a highly optimized, vectorized Parquet reader, and Parquet is our preferred format at this time.
Dremio is a distributed process. You can run it in containers, on bare metal, or as a YARN native app in your Hadoop cluster. You can scale up/down as needed.
Dremio connects to sources in different ways depending on the source: RDBMS via JDBC; MongoDB via java driver; Elasticsearch via their driver; files like Parquet and JSON using our own readers; Hive we pick up schema from Hive metastore and use our own file readers; etc.
Dremio’s UI is intended for data curation, searching catalog, admin, etc. We expect users to issue queries using their favorite tools over ODBC/JDBC/REST.
Hope that helps and let us know if you have any other specific questions.
Could you share some details about how Dremio keeps the data reflections updated? Is it on a time based schedule, users can call a process to get it refreshed or does poll the underlying datasets to catch changes to the data and trigger a refresh automatically?
What if the job to create a reflection is relatively complex and is very expensive to run, how does the Dremio engine handle this?
What about when reflections have a long dependency chain, would the engine handle updates on the right parts of the chain?
To answer your questions, you specify an SLA for data freshness on each source. Dremio learns how long the update takes and starts in advance to keep the SLA. Depending on the source the update can be a full update or incremental. Currently Dremio does not provide any CDC abilities.
Alternately, you can disable automatic refresh and explicitly call the process via REST.
If there are multiple reflections from a single source, Dremio will daisy chain the updates to minimize load on the source.
In the case where analysts have built up reflections that produce similar but not identical values, how does Dremio decide which one to use? Is the sub-tree matching of query plans enough to identify the correct one? Does Dremio allow the user to choose which one to use?
For Example: If i’m looking for a GMV value but there are multiple version of GMV in different reflections, how do i as a user know which is the correct one for me to choose?
The substitution of a reflection for a physical source is only possible if the two are equivalent in terms of the data and the query. For example, if a query requests an aggregated representation of a dataset, then both raw and aggregation reflections of that physical dataset can be substituted for the physical dataset. If either raw or aggregation reflections are missing columns that are requested by the query, they cannot be substituted.
If a query is a join or union of multiple datasets, then Dremio may use a combination of physical datasets and reflections to execute the query. Another good example of this is star and snowflake schemas where the fact table may have a reflection and the dimension tables do not.
Alternately, you can join a star or snowflake schema in a virtual dataset and create a single reflection that can be used to optimize all the permutations of joins in the schema. You can read more about starflake data reflections here:
Personally I think of Reflections like indexes in a database. You as a user don’t really choose one reflection over another. Rather, the system chooses based on substitution equivalency and cost.
Dremio maintains a dependency graph of datasets. It is very likely a reflection on a superset of data can be used to accelerate many different descendent datasets without the user needing to know about the parent reflection.
But maybe that isn’t quite answering your question? Let me know.
Perhaps my question was off base but i was trying to better understand how involved would the management of a Dremio environment be for the administrators.
In terms of the building of reflections, is it advisable that users of Dremio are allowed to build up any reflections to help speed up their query executions? Would this lead us to a place where we have too many reflections and would need to run a larger and larger set of executors just to keep up with what our users are doing?
Would we also end up with the case of having many reflections that do the same or similar things, leading to wasted resources or even worse, stuff like competing version of the same metric (ie: one revenue metric that includes taxes while another excludes taxes).
How much are administrators expected to keep things managed and under control?
We’ve seen it a few different ways, and it depends on the sophistication of your data consumers.
One model is that reflection admin is purely a data engineering job. In this model, people who log into Dremio to search the catalog, find datasets, and build new datasets have the opportunity to “upvote” the need for reflections on a dataset:
Data engineers can then reason about deciding which reflections would provide the best value in terms of resource utilization. This is an area of the product that we are developing to be more automated, and to provide smarter recommendations across datasets and workloads. Today the data engineer makes these decisions. In practice, a relatively small number of reflections can typically accelerate a wide range of workloads.
In another model the data consumer decides for themselves what to accelerate. For this to work the data consumer needs to be more sophisticated in order to decide whether, for example, an aggregation reflection or a raw reflection is more appropriate for their queries. Whether sorting the data would be beneficial, etc. In this scenario there tends to be more reflections created, however this can be fine as the cost is primarily storage and reflection maintenance. Dremio is pretty smart about optimizing the daisy chaining of reflection updates to minimize the load on the source system.
In both scenarios Dremio’s workload management features (Enterprise Edition) can help you mange how resources are allocated to reflection maintenance jobs vs. user jobs.
@elhh82 yup they are written using the open source parquet spec so any other parquet reader will be able to read them. That said, if you want other tools to use the reflection files, we recommend using CTAS. This is because Dremio manages refresh, storage, and expiration/deletion itself, so the storage location (directory) of the most current reflection will change with each reflection refresh’s job id
I have the same question about reflections as elhh82. Yes, user don’t care which reflection to use, but the reflections must be create by somebody, just like in RDBMD, index must be created by the DBA. For a reflection to be useful, and even better to be useful to more than one queries / users, it need to be created efficiently. So the creator needs to understand how Dremio choose which reflection for a given query, just like a RDBMS DBA knows how a index is chosen for a SQL query.
How the reflection works and how Dremio chooses reflection for a give query, is the information I need as “DBA” for Dremio.