Difference in time between connecting through Dremio and connecting directly


As a bit of background, I have an app that requires massive amounts of data pulls from different databases and I’m planning to use Dremio to streamline connecting to all these databases. I’m planning to do big data pulls (1.5 million rows at a time or more) and so timing is of utmost importance.

Currently, when I connect to my Postgres db directly through sparksql (the connector that must be used since I’m planning to do computations on the data sets), it is 30% faster than when I connect to my Postgres db through sparksql and Dremio using jdbc (12 sec user time vs 9 sec user time). I’ve kept the configuration options the same as the default configurations but I have looked at them and they look to be very good for my data. Is a 30% slow down something that should be expected when connecting through Dremio? Or should the configuration options be changed to support faster queries?

[All numbers above are for querying 1.5 million rows in a PostgreSQL database]


Putting Dremio in between sparksql and Postgres does introduce an extra hop, and this take some. It’s hard to say exactly how much slowdown to expect. You can get some idea of what is taking time by looking at the job profile. Find it in the jobs page of the Dremio UI. If you want, you can download it and share it here.

You can speed up any subsequent queries on the same dataset by enabling reflections. Reflections effectively cache the dataset (raw or aggregated) and serve queries from the cache.
More info here: https://docs.dremio.com/acceleration/

It makes sense this will be slower as you are effectively moving the same data twice over the network (from Postgres to Dremio, then from Dremio to Spark).