Is there any approach to control the query order for data join

popejune March 16, 2021, 9:34am 1

For the SQL below, it will query table1 (type=‘xxx’) and table2 (all data) in parallel, and has join the data with id. If the table2 has huge data, the performance is much worse.

If we could give some hint to let it run in order (query table1 firstly, and then query table2 with id directly), it would have better performance.

select * from table1, table2
where table1.id = table2.id
and table1.type='xxx'

balaji.ramaswamy March 17, 2021, 5:38am 2

Dremio should automatically select the probe (bigger table) and the build side of the join. Are you able to share the query profile?

popejune March 18, 2021, 3:15am 3

Hi @balaji.ramaswamy,

I did not have profile now. I am building a new plugin to query REST API, which include huge data and accept id to return the related records. Before the join, it has to get all the data.

How does dremio estimiate the query size? For the new plguin, does it mean it need get all the data, then it could know the size?
For the custimized plugin (table 2), how to let dremio hold the query and wait for the id from table1?

balaji.ramaswamy March 18, 2021, 5:47am 4

When you say new plugin, what is the plugin type? are both plugins written by you or they standard connectors provided by Dremio?

popejune March 19, 2021, 1:35am 5

Hi @balaji.ramaswamy

Yes, we wrote both of the plugins which are based on REST API.

balaji.ramaswamy March 19, 2021, 6:55am 6

Not sure I can help on this as I do not know what plugin this is

popejune March 19, 2021, 7:05am 7

As the general question, how does dremio estimiate the query size?

balaji.ramaswamy March 19, 2021, 7:56am 8

Dremio has different rules based on the file format, estimates would be better on a Parquet format for example

Topic		Replies	Views	Activity
Dremio SELECT * on joins	2	5260	November 6, 2018
Is there a way to push down manually?	4	1542	August 3, 2020
Dremio Join of large table in S3 and RedShift	1	1041	December 20, 2021
Join tables from same data source	1	1375	September 8, 2020
Dremio query forming	23	1419	May 17, 2023