What Dremio is exactly?

Hi All

I am a Newbie and evaluating Dremio to see what it does better for which use-cases.
My understanding:

  • Dremio is a Self Serve Data platform for BI/Data Science use-cases where the data is prepared, tagged as virtual data assets so that it can be shared within the enterprise. Just like Tableau prepares data assets to be consumed by business users.
  • The virtual data asset is in the Arrow data format which supports reflections (materialized views) and caching support as part of Calcite SQL queries.
  • Several data sources can be joined/queried and some data wrangling (cleaning, transformation can be applied)
  • The final data set prepared (as virtual arrow representation) can be persisted in Parquet format to any storage.

I have the following questions please:

  • If I need to make a query to join two data sets each with 1 TB in size, what is the memory I would need to make this join happen, apply to transform the data and save the merged data as virtual data asset into Parquet? If the whole data will not be brought into memory, then a distributed processing engine is required (Spark). So How Dremio supports this use-case?
  • Dremio can be positioned still Data Wrangling tool with extra features which makes it stand out from the crowd? Or in the market how is it positioned? May I use it for interactive BI use-cases with OLAP queries under second response times? In this regard, can it be positioned as a competitor of AtScale (offering virtualized data warehouses)
  • What is the performance gains if the same is done with Presto, Hive LLAP, Impala, OLAP-on-Hadoop with Jethro, AtScale (Data Acceleration)
  • Can it read ORC tables as the data source?
  • Can we integrate it with any Database such as Teradata?
  • Cloud deployment options?
  • If my total data is 100 TB, what is the Dremio cluster size I need to have (memory wise)?

Thanks a lot…
Cengiz

Hi Cengiz, thanks for the questions.

First and foremost, Dremio is a scale-out SQL engine based on Apache Arrow. You can think of it as an alternative to Presto, Hive LLAP, Impala, etc. Note that Dremio is focused on fast, interactive queries, so there’s usually still a role for Hive or Spark for long-running ETL workloads (eg, things that take hours to complete).

Unlike Presto, Hive LLAP, Impala, and other SQL engines, Dremio includes extensive support for Data Reflections (similar to materialized views), and query rewriting. So, like Teradata and Oracle, Dremio will automatically rewrite your queries to make use of Data Reflections when the costs are determined to be less than running the query on the raw source data. This can easily be 100x-1000x faster than querying the raw data. Also similar to Teradata, Dremio provides extensive workload management capabilities, so operators can prioritize certain workloads.

As you have noted, in addition to providing the world’s fastest way to query data on the data lake, Dremio also provides capabilities around 1) data curation, 2) data virtualization, 3) row and column-level access controls, 4) an integrated data catalog, and 5) data lineage capabilities. Yes there are products for each of these, but they are mostly proprietary and stand-alone tools. We think the key to making data consumers for self-sufficient is to provide an integrated platform that removes friction and simplifies the daily workflow for analytics.

I hope this helps!

More answers to your questions below.

2 Likes

Thanks a lot Kelly. I will try to dig this more. I assume Dremio can create the abstraction which can be shared without loading the data into memory. It does not require data marts, cubes, etc… Whereas At Scale requires to design abstract data cubes and uses some level of caching to be used for BI analytics. Dremio seems superior (supporting all access patterns for Data Scientists, ETL, BI Analytics, etc) in this regard but I will need to make a benchmark comparison as well to prove its value on the performance as well.

One other thing is that the data source can be in cloud (such as Snowflake) where the business users can be on the premise. In that case, the performance will be impacted whereas AtScale is indexing the data on-the premise for faster access, I assume.

Many thanks.

On your last question, Dremio’s Data Reflections live wherever you have Dremio deployed. So, if Dremio and the data sources are in the cloud, and the users are on-prem, then the setup will be similar to Snowflake - most network latency will come from the connection b/w the data consumer and the Dremio cluster. However, this is what I would do anyway as usually the heavy lifting is done near the data and the number of records that moves between the data consumer and Dremio is a small fraction of the raw data.

Regarding benchmarking Dremio and other proprietary products, you’ll need to have access to their software and documentation, which are not freely available as we make Dremio.

Be sure to take the Dremio University courses to help you make the most of the product!

Kelly

1 Like