At this time we do not have any public benchmarks. We encourage users to perform their own performance tests that are representative of their workloads and infrastructure. Certainly performance testing is part of most evaluations of Dremio and we regularly hear very favorable impressions from users who compare Dremio to a variety of technologies including some of those you have listed.
We are always happy to help with timing questions, just ask.
Iâve personally found Spark/Dremio/Hive do not exclude themselves. The simplicity of Dremio is a bonus for the people that âjust wanna do SQLâ. To build a metaphor, when people come to âfishâ data in the lake, some come with DIY rods, others with off-the-shelf rods while the data-fishing gurus come with all sorts of rods, each for its own advantages.
Weâre testing it on-par with Spark/SQL and the numbers are somewhat equal. Most of the time is spent in reading rather than the Dremio or Spark layers.
Have you tried with data reflections yet? Our goal is that raw performance is roughly on-par with the best SQL engines out there (we have heard 2x-5x better from some users), and data reflections can then provide a very significant enhancement on top of raw query execution. As the community continues to make improvements to Arrow those carry through to query execution, with and without data reflections.
Dremio includes a vectorized Parquet reader that reads directly to Arrow, so you should see some nice benefits when Parquet is the source if youâre seeing reads as the bottleneck (assuming I/O isnât the limiting factor, of course).
And I agree that Hive and Spark still have a role in the modern Big Data stack. Hive/Spark can be especially good for âlong haulâ ETL, for jobs that run for extended periods of time. Dremio is especially good low latency query processing, and for âlast mileâ ETL where transformations are applied, without making copies of the data. Of course Dremio does other things, like a data catalog users can search, data lineage, curation abilities, etc. But it can make a lot of sense to combine Hive, Spark, and Dremio together.
Thanks also Kelly for clearing the scope of each techniques, it is quite important to define the perimeters for each, and then to define uses cases.
In my opinion, it would be quite important to improve the interactions betwwen dremio and spark by developping a specific spark connector for reading/writing from/to dremio. For reading, spark could directly read form arrow like it does with python udf, and for writing, instead of using the REST API for creating physical/virtual datasets or launching a reflection task, the connector would enable this directly⊠For short a tight integration with spark would be interesting (for instance also for streaming capabilities of spark)
If so, i would suggest two missing part in dremio :
a workflow editor, something we find for instance in dataiku, domino datalabs or other products.
a action sequences along the sql editor like it may be found for instance in datiku and opendatsoft
The worflow editor is useful to describe actions on different datasets (joins, split, multiplexing, external actions like spark notebook, and so onâŠ); it lso gives a visual ligeage of what it is done. What i see is the use of the sql editor for cleansing, handling data, âŠ, on a dataset to create virtual datasets on a single source, and the workflow editor for combining, processing these vdatasets to produce other ones.
The actions sequence view is a view on the left part of the sql editor describing all the transformations done on the current dataset (it is called scenario in dataiku) ; we then have visually all the sequence of what is done on a given dataset⊠and we may want to change parameters directly of a transformation, or the order of transformations. It is very useful.
Thanks for the feedback! @can is following this thread as well.
Requests for workflow-style editors come up frequently. This is something we are looking at. Sometimes it comes up because people think of Dremio as a kind of ETL tool, so I thought it would make sense to comment on this a bit. Currently we are not targeting large, bulk transformations with Dremio, or what you might call âlong haulâ ETL. We think ETL tools or long-running engines like Hive or Spark are the right way to do this. Instead, we are targeting âlast mileâ transformations, filtering, aggregations, and security controls (eg, masking) that are performed at runtime.
In our Enterprise Edition we include a data provenance and lineage capability. Hereâs what that looks like:
This feature helps visualize the relationships between datasets. We track relationships in a dependency graph, which makes it easy to understand data use patterns and âwhat ifâ scenarios, among other things.
Another feature relevant to your comments is the transformation history available on every dataset. You can find this on the right hand side of the dataset viewer:
You can hover over each dot to see the changes applied to the dataset as well as who performed them. This is effectively a versioning mechanism for the dataset - if you click on any of these dots, you quickly toggle to that version of the dataset.
If of any help to somebody, Presto (1.42m) vs. Dremio (30s) at the biggest query we have (daily active users on all the assets) for a given day. (1TB of data in Parquet format by day, asset and category of events). Presto goes through Hive for the metadata and onwards to HDFS while Dremioâs query was direct on the HDFS.
The only thing I see is that Presto was able to view the âmapâ type we have in our data while for Dremio some âflattenningâ (and complex interpretation) was needed when working with Hive.
There are trade-offs. For Presto to be of any good, it needs Hive and tables. Dremio can go directly to HDFS and that works quite nice, including reading from âmapsâ (if you do some simple flattening) but Hive integration lacks support for the âmapâ and âlistsâ (because Apache Drill also lacks it I believe).
Both runs had the exact same query, adapted to the SQL specific of each.
No reflections. 12 nodes, same machines. Weâre running them in containers so itâs easy to provision and remove clusters with ease on the exact same machines.
It would be interesting to see if reflections would further enhance performance for you. You could implement a few different schemes for sorting/partitioning/filtering in raw reflections, as well as a few aggregation reflections.
If thatâs something youâd like to brainstorm about, maybe you could share some details about your queries and schema and we could make some suggestions.
When you say Spark-Dremio integration, are you looking to Dremio to write Parquet files so Spark can consume it or Dremio to read the Parquet files generated by spark. Both are possible now. The former can be done using CTAS while the latter can be achieved by adding S3 or HDFS as a source that stores the Parquet files
@kelly mentioned there is âSpark+Dremio integrationâ, I donât know what it would be. But I would export Dremio & Spark to work together and look like a single engine outside.