What Dremio is exactly?

Cen_giz · April 22, 2019, 4:04pm

Hi All

I am a Newbie and evaluating Dremio to see what it does better for which use-cases.
My understanding:

Dremio is a Self Serve Data platform for BI/Data Science use-cases where the data is prepared, tagged as virtual data assets so that it can be shared within the enterprise. Just like Tableau prepares data assets to be consumed by business users.
The virtual data asset is in the Arrow data format which supports reflections (materialized views) and caching support as part of Calcite SQL queries.
Several data sources can be joined/queried and some data wrangling (cleaning, transformation can be applied)
The final data set prepared (as virtual arrow representation) can be persisted in Parquet format to any storage.

I have the following questions please:

If I need to make a query to join two data sets each with 1 TB in size, what is the memory I would need to make this join happen, apply to transform the data and save the merged data as virtual data asset into Parquet? If the whole data will not be brought into memory, then a distributed processing engine is required (Spark). So How Dremio supports this use-case?
Dremio can be positioned still Data Wrangling tool with extra features which makes it stand out from the crowd? Or in the market how is it positioned? May I use it for interactive BI use-cases with OLAP queries under second response times? In this regard, can it be positioned as a competitor of AtScale (offering virtualized data warehouses)
What is the performance gains if the same is done with Presto, Hive LLAP, Impala, OLAP-on-Hadoop with Jethro, AtScale (Data Acceleration)
Can it read ORC tables as the data source?
Can we integrate it with any Database such as Teradata?
Cloud deployment options?
If my total data is 100 TB, what is the Dremio cluster size I need to have (memory wise)?

Thanks a lot…
Cengiz

kelly · April 22, 2019, 6:15pm

Hi Cengiz, thanks for the questions.

First and foremost, Dremio is a scale-out SQL engine based on Apache Arrow. You can think of it as an alternative to Presto, Hive LLAP, Impala, etc. Note that Dremio is focused on fast, interactive queries, so there’s usually still a role for Hive or Spark for long-running ETL workloads (eg, things that take hours to complete).

Unlike Presto, Hive LLAP, Impala, and other SQL engines, Dremio includes extensive support for Data Reflections (similar to materialized views), and query rewriting. So, like Teradata and Oracle, Dremio will automatically rewrite your queries to make use of Data Reflections when the costs are determined to be less than running the query on the raw source data. This can easily be 100x-1000x faster than querying the raw data. Also similar to Teradata, Dremio provides extensive workload management capabilities, so operators can prioritize certain workloads.

As you have noted, in addition to providing the world’s fastest way to query data on the data lake, Dremio also provides capabilities around 1) data curation, 2) data virtualization, 3) row and column-level access controls, 4) an integrated data catalog, and 5) data lineage capabilities. Yes there are products for each of these, but they are mostly proprietary and stand-alone tools. We think the key to making data consumers for self-sufficient is to provide an integrated platform that removes friction and simplifies the daily workflow for analytics.

I hope this helps!

More answers to your questions below.

Cen_giz:

Dremio is a Self Serve Data platform for BI/Data Science use-cases where the data is prepared, tagged as virtual data assets so that it can be shared within the enterprise. Just like Tableau prepares data assets to be consumed by business users.

Unlike data prep tools, Dremio does not make a copy of the data. These transformations are applied at query time, always giving you access to the freshest data, and without creating security and governance risks that come from managing more copies of the data.

The virtual data asset is in the Arrow data format which supports reflections (materialized views) and caching support as part of Calcite SQL queries.

No, virtual datasets are like views in a relational database. There is no copy of the data. Instead, they are a logical representation of the raw physical data. Arrow is only used to manage the data during query execution.

Several data sources can be joined/queried and some data wrangling (cleaning, transformation can be applied)

Yes

The final data set prepared (as virtual arrow representation) can be persisted in Parquet format to any storage.

Optionally, any dataset can be accelerated using Data Reflections. A Data Reflection optimizes the physical structure of the data for accelerated processing (eg, aggregation, sorting, partitioning, etc). A given dataset (physical or virtual) may have multiple Data Reflections, each used to accelerate different query patterns.

Cen_giz:

If I need to make a query to join two data sets each with 1 TB in size, what is the memory I would need to make this join happen, apply to transform the data and save the merged data as virtual data asset into Parquet? If the whole data will not be brought into memory, then a distributed processing engine is required (Spark). So How Dremio supports this use-case?

The answer depends on the query and the data. For example, if your datasets are 1TB each and your query is highly selective, then Dremio may apply optimization such as partition pruning to minimize the data that needs to be read into memory. In addition, your query may not require all data to fit into memory.

In cases where the data for a query does not entirely fit into memory, Dremio will spill data to disk. This will increase latency. In general your cluster should be sized to accommodate 1) the total size of data that must fit in memory during query execution, and 2) the maximum concurrency. Dremio’s workload management controls give you a lot of flexibility for managing concurrency and resource utilization.

Dremio can be positioned still Data Wrangling tool with extra features which makes it stand out from the crowd? Or in the market how is it positioned? May I use it for interactive BI use-cases with OLAP queries under second response times? In this regard, can it be positioned as a competitor of AtScale (offering virtualized data warehouses)

Dremio provides a number of great features for data curation, and unlike data prep tools does not make copies of the data. However, I wouldn’t considering Dremio purely for this use case. It is more a feature of an integrated solution. Dremio is mostly used for interactive BI on the data lake (Hadoop, S3, ADLS, etc), and for analytics that span the data lake and traditional sources like Oracle and Teradata. Dremio removes the need for cubes, Tableau extracts, data marts, etc.

What is the performance gains if the same is done with Presto, Hive LLAP, Impala, OLAP-on-Hadoop with Jethro, AtScale (Data Acceleration)

This is hard to generalize, but typically you would see 3x-10x faster than Presto and Hive LLAP without Data Reflections, and 50x-1000x faster when using Data Reflections.

Can it read ORC tables as the data source?

Yes.

Can we integrate it with any Database such as Teradata?

Yes

Cloud deployment options?

Today, Kubernetes with Helm charts. Soon, ARM and Cloud Formation templates (for simple eval purposes).

If my total data is 100 TB, what is the Dremio cluster size I need to have (memory wise)?

See prior question. You may have 100TB, but does every query need to scan all 100TB? Think in terms of 1) maximum dataset size for a query, 2) concurrency, and 3) latency concerns.

Cen_giz · April 22, 2019, 7:08pm

Thanks a lot Kelly. I will try to dig this more. I assume Dremio can create the abstraction which can be shared without loading the data into memory. It does not require data marts, cubes, etc… Whereas At Scale requires to design abstract data cubes and uses some level of caching to be used for BI analytics. Dremio seems superior (supporting all access patterns for Data Scientists, ETL, BI Analytics, etc) in this regard but I will need to make a benchmark comparison as well to prove its value on the performance as well.

One other thing is that the data source can be in cloud (such as Snowflake) where the business users can be on the premise. In that case, the performance will be impacted whereas AtScale is indexing the data on-the premise for faster access, I assume.

Many thanks.

kelly · April 22, 2019, 9:05pm

On your last question, Dremio’s Data Reflections live wherever you have Dremio deployed. So, if Dremio and the data sources are in the cloud, and the users are on-prem, then the setup will be similar to Snowflake - most network latency will come from the connection b/w the data consumer and the Dremio cluster. However, this is what I would do anyway as usually the heavy lifting is done near the data and the number of records that moves between the data consumer and Dremio is a small fraction of the raw data.

Regarding benchmarking Dremio and other proprietary products, you’ll need to have access to their software and documentation, which are not freely available as we make Dremio.

Be sure to take the Dremio University courses to help you make the most of the product!

Kelly

Topic		Replies	Views
Difference between Dremio vs Presto	4	12414	June 19, 2020
Performances comparisons	18	11775	February 1, 2021
Dremio Architecture query	6	1643	March 5, 2019
How does dremio move data?	10	3086	July 13, 2021
Dremio Use Case Question	2	1609	February 26, 2019

What Dremio is exactly?

Related topics