Hello Dremio crowd,
we have encountered some seemingly strange behavior from the query planner in regards to queries with joins.
Query 1:
SELECT
COUNT(*)
FROM
“folder1”.“folder2”.“k”
INNER JOIN “folder1”.“folder3”.“ud” ON
k.m = ud.m AND
k.bk = ud.bk AND
k.g = ud.g
– INNER JOIN “folder1”.“folder2”.“mgj” ON
– mgj.m = k.m AND
– mgj.bk = k.bk AND
– mgj.g = k.g AND mgj.m IS NOT NULL AND mgj.bk IS NOT NULL and mgj.g IS NOT NULL
INNER JOIN “folder1”.“folder2”.“b” ON
ud.m = b.m AND
ud.bk = b.bk AND
ud.g = b.g AND
k.id = b.k_id AND
b.k_id IS NOT NULL
Query 2:
SELECT
COUNT(*)
FROM
“folder1”.“folder2”.“k”
INNER JOIN “folder1”.“folder3”.“ud” ON
k.m = ud.m AND
k.bk = ud.bk AND
k.g = ud.g
INNER JOIN “folder1”.“folder2”.“mgj” ON
mgj.m = k.m AND
mgj.bk = k.bk AND
mgj.g = k.g AND mgj.m IS NOT NULL AND mgj.bk IS NOT NULL and mgj.g IS NOT NULL
INNER JOIN “folder1”.“folder2”.“b” ON
ud.m = b.m AND
ud.bk = b.bk AND
ud.g = b.g AND
k.id = b.k_id AND
b.k_id IS NOT NULL
b has around 1bn rows, k has more than 100m rows.
ud only has 14 rows mit 14 different combinations of m/bk/g and mgj has around 1500 such combinations and rows.
The curious thing is that the query without the mgj JOIN scans more than 700m rows and takes a very long time, whereas the one with it only scans 20m and is very quick. This also only happens including the NOT NULL conditions in mgj.
Do we really have to consider joining this table, even if we do not need any information from it?
I did notice that in the planning tab of the raw profile, it says that a cached plan is used for the long query. Could that have something to do with it? If so, how do I clear that cache? I added a random where condition, which removed the cache notification, but also did not make the query faster, so I guess that’s not it.
Unfortunately I cannot share the entire profile.
Best regards