I only recently discovered Dremio. I’m impressed with it so far. For some use cases, the tool will work great. However, I am having poor performance on large queries.
When I connect to a large table in MS SQL or when I do a single action on the same table (e.g. drop a column, or filter a column), it takes a very long time. By way of example, I connected to one table that has 396,474 rows and 25 fields. Just loading the table, SQL Management Studio took 58 seconds and Dremio took about 2.5 minutes. Whether I load, drop a column, filter, subset or what have you, Dremio takes twice as long at best.
It’s unlikely the problem is the database server, network connection, or my computer. (Although I am running Windows…enough said).
Does anyone have any tips to make the tool perform faster?
Hi - how are you measuring time for these different operations?
When you say “drop a column” what does that mean in this context? Dremio cannot alter your source data, so I’m confused.
In general queries shouldn’t be any slower through Dremio (at least not significantly so). The caveat is if you’re comparing elapsed time from Dremio’s browser interface the time reported in the query profile includes the time to write the results out to Dremio’s query history (you can navigate to a job and see the results that were returned for any query. these are cleaned up on a schedule of ~30 days), whereas when running the query over ODBC >> Dremio >> SQL Server this time will not be reported.
Also, you can create a Data Reflection on your source table: https://docs.dremio.com/acceleration/reflections.html
This will significantly speed up queries. You can create a reflection on the whole physical table, or on virtual datasets. If most of your queries are BI-style queries (eg, lots of aggregations), then Aggregation Reflections can further improve your performance for these workloads.
@jprins sounds like previews (and potentially Runs) are taking longer than expected in your case causing the overall experience to be slow. Dremio does a preview after each action (drop column, filter, group by, etc.) to validate the query and also to just show you a “preview” the action you just did.
Could you share a query profile for one of the slow preview jobs? Also it would be great if you could “Run” the same query on SQL Management Studio, and then in Dremio, share the times, as well as the query profile for the Dremio run. Here are instructions on how to share query profiles.
When I run the query in either software, it shows “elapsed” time.
In Dremio, you can select a field and perform some operations on it. One of those operations is “Drop”. So it just removes that field from the query. Every time I drop, it re-runs the 2+ minute query.
I can see the time reported in the query history. It is reporting a time of 2+ minutes.
I played with reflections. It dropped the query time by a fair amount. Now it runs only a little longer than SSMS, which is totally reasonable.
Thanks for the suggestion!
Hi @can. What you described is exactly what is happening. The times I did share were the Run times.
I looked at the query profile, and the json output shares information I might not be comfortable sharing. For now, I’ll continue playing with Dremio and the reflections feature.
I might also just change my workflow too. For bigger tables, building out the SQL with Top 1000 at first and changing it later will help at times.
Operations on datasets that have a reflection should be very fast, typically a few seconds or less. Are you saying those queries are taking 58 seconds?
Of course if you’re on Win7 with spinning disks and a busy machine, it might not be ideal. We designed Dremio to run on clusters of servers for production - the windows builds are for non-production workloads.