I just started using Dremio and am trying to build a testing module on top of Dremio to measure it’s performance. I have two questions below:
- Does Dremio log its performance data per each query anywhere that can be queried?
- Is Dremio known to work with any automation tool that can mimic the user querying actions? (I suspect maybe normal web UI testing tool would work in this scenario as it’s spin up as a web service. But please do suggest if you have idea or better option. )
Thank you and hope I can learn a lot from this community!
Regarding performance data, you can follow this guide - once you identified the job, the right hand section has a
Profile tab that has metrics.
For automation you could use JDBC/ODBC. Our next release will include a REST API for running SQL queries.
Thank you for the reply, while i’m reading through the guide for the performance, i realize it’s probably explaining the same data that stored in the queries.json based on this documentation regarding Dremio logging:
My thinking is querying from this log table and pipe the data to Power BI for representation. Do you think this doable?
For the automation piece, base on the JDBC/ODBC documentation, looks like it indicates integration with PBI and Tableau. However, I am looking for automated solution that would mimic exact user activity on Dremio, such as navigating, entering some query, and hit execute. I don’t know if anyone has done this before?
Yes, I would imagine using REST API for kicking off certain query execution would be ideal for this scenario. Thank you for the suggestion!
We expect most users will issue queries over ODBC/JDBC/REST to run their analytics. We do not consider Dremio to be a visualization/reporting tool. So you could test the performance of those user actions you describe, but the performance benefits of Dremio are fully experienced from visualization and reporting tools or other external processes.
Keep in mind the difference between Preview and Run - the default is to first run a query through the UI in preview mode. This allows the query to always be fast, but it does not return all results.
Regarding sampling - is there anyway to change the sample set so that it will show up in preview mode?
For example - when joining two data sets on an inner join, I’ll often get no data returned in preview mode. When I run the query however, data comes back. Can we change how it is sampling so that it captures a sample of data that will be returned in the joins?
In edition, it always defaults to the ‘preview’ mode on any change to the data set (‘Keep Only’, ‘Exclude’, etc.) Is there any way to keep it in ‘Run’ Mode so I wouldn’t have to run the data set in between each transformation?
Appreciate the insight!
Hey @mathew.lee, thanks for the questions/details! One thing we’re actively looking into right now is making Previews/Runs be non-blocking so that users can keep working on the SQL or continue with transformations without having to wait on a result set.
Some things we’re considering going forward (without specific timelines):
- Consider whether/how “modes” make sense (no preview, preview, run). Having a persistent Run mode is a bit concerning, as this could, in some cases, take over cluster resources very easily.
- Consider approaches to reliably show previews even when users have sparse filters, joins, etc. that don’t have “matches” in our current sampling approach.
- Smarter preview logic via better caching and considering larger set of variables in combination: source latency, avg. record size, record counts, current load and others.