Does anyone have any benchmark numbers between Dremio and Athena. We are evaluating Athena and Dremio for our business use cases.
From few of our queries, having data loads(ranging from 10GB to 500+ GBs), we din’t find much difference in terms of performance metrics.
Both Dremio and Athena performed equally well for our queries and workloads.
By going through different online blogs and posts, i was under the impression that Dremio out performs Athena/Presto in terms of both performance and cost. But we didn’t find that huge difference.
Does anyone have any benchmarks which proves these assumptions.
Or do we need to tune our Dremio cluster to achieve those optimised results.
I think you might wanna create your own benchmarks and contribute to the community by sharing the results you find. Would be awesome to have this kind of information from a real use case.
You should not compare 2 nodes Dremio engine to AWS Athena if you want to see real performance results. AWS Athena is serverless and actually no one knows how many nodes it’s running underneath the hood.
We did run TPC-DS benchmarks against Athena, and as far as I can tell the performance is similar to PrestoDB with 8 worker nodes. The fact that you see equal performance with a 2-nodes Dremio cluster tells a lot. I would recommend to use at least 4 nodes or more to see the real difference in performance results. On the same 8-nodes count you may see up to 6x Dremio performance compared to Athena.
While a little off-topic, I can highly recommend the paper from Tan (and Michael Stonebraker!) et. al, VLDB 2019: http://www.vldb.org/pvldb/vol12/p2170-tan.pdf
They have benchmarked Vertica, Presto, Athena, Redshift (+Spectrum), Hive using TPC-H (not TPC-DS which would make your comparision easier).
Thanks for the detailed info. Yeah you are right, its hard to predict how many nodes, athena runs in the backend and its a good idea to increase the nodes in Dremio.
Will try these approaches and see how the numbers look like.