For the SQL below, it will query table1 (type=‘xxx’) and table2 (all data) in parallel, and has join the data with id. If the table2 has huge data, the performance is much worse.
If we could give some hint to let it run in order (query table1 firstly, and then query table2 with id directly), it would have better performance.
select * from table1, table2
where table1.id = table2.id
Dremio should automatically select the probe (bigger table) and the build side of the join. Are you able to share the query profile?
I did not have profile now. I am building a new plugin to query REST API, which include huge data and accept id to return the related records. Before the join, it has to get all the data.
- How does dremio estimiate the query size? For the new plguin, does it mean it need get all the data, then it could know the size?
- For the custimized plugin (table 2), how to let dremio hold the query and wait for the id from table1?
When you say new plugin, what is the plugin type? are both plugins written by you or they standard connectors provided by Dremio?
Yes, we wrote both of the plugins which are based on REST API.
Not sure I can help on this as I do not know what plugin this is
As the general question, how does dremio estimiate the query size?
Dremio has different rules based on the file format, estimates would be better on a Parquet format for example