Difference between Dremio vs Presto

Hi Folks,

Can anyone describe me how Dremio is different from Presto. I am just started following dremio and I have gone through some videos but I found some similarity with Presto. If anyone through thoughts on this would be appreciated.

Thanks

Presto is a distributed SQL engine. Dremio is a lot more than that. You could think of it as a “Data-as-a-Service Platform” that sits between all your data and the tools that people want to use to analyze it (Tableau, Qlik Sense, Power BI, R, Jupyter, etc.) Traditionally, companies have had to use a combination of 5-10 different tools, and a lot of custom development, to make data available for analytics. That includes data warehouses, ETL, OLAP cubes, aggregation tables, data extracts, etc. In addition to the obvious cost and complexity, this made self-service impossible. Dremio basically collapses/simplifies the entire analytics stack.

Here are a few examples of features that Dremio has that SQL engines like Presto do not:

  • Acceleration. Dremio’s data reflections enable interactive speed queries (both BI and ad-hoc queries) on any data volume. For example, you can achieve sub-second response time when running BI queries on a PB-scale dataset. Dremio includes a cost-based optimizer that rewrites query plans internally to utilize one or more reflections, often reducing query execution time by 1000x or more.
  • Data curation. Dremio includes a visual interface that’s similar to Google Docs (with virtual datasets instead of documents). Non-technical users can create new virtual datasets through the visual interface, and share them with their colleagues.
  • Advanced query push-downs. Dremio has deep integration with non-relational databases like Elasticsearch and MongoDB. It can translate a broad range of queries (i.e., relational algebra) into the query language of the underlying source. This includes aggregations, projections, search and more.
  • Data catalog and lineage. Dremio indexes the metadata of all physical and virtual datasets so that users can find what they need. It can also outline what data users are accessing, and how different datasets are related to each other.
  • Raw execution speed. Dremio is built on Apache Arrow, so execution is columnar in memory. That translates into high performance even without query acceleration (i.e., data reflections). It will also enable Dremio to leverage GPUs, and integrate more tightly with data science tools like Python and R.

I hope that helps. Those are just a few high-level differences.

4 Likes

Regarding the last point, raw execution speed, Presto also does columnar processing using an in memory representation that is similar to Arrow.

Without a shared standard like Arrow, that data must be serialized before handing off to another process. With Arrow this step is obviated. That’s a big benefit: 60-80% of CPU and unnecessary copies in many cases.

There’s a lot more to Arrow than being columnar. Now it also means Arrow Kernels, which are highly optimized low-level operators that can be hardware optimized.

Also, see this important announcement about the Gandiva Initiative, which brings LLVM JIT compilation and other benefits to Arrow: https://www.dremio.com/announcing-gandiva-initiative-for-apache-arrow/

@tshiran,
Very well explained.