Optimized subselect for Mongo

pazaan · November 16, 2020, 4:04am

I’m wanting to use Dremio to be able to make and use reflections that help me optimize the underlying queries to Mongo (by using the indexes I have on our operational database).
For example, I’d like to have a table with a reflection for all of our user IDs. I have this set up, and no problem so far.
I would like to use one of the indexes on the Mongo collection that is based on userID. So, I thought that I’d be clever and do something like: `SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’", hoping that Dremio would evaluate the subselect first and use Mongo’s “$in” operator, or even better - parallelize the requests to Mongo.
What happens is that Dremio does a full collection scan on the Mongo collection, because it’s only requesting the “type” query to Mongo (it must be doing the subselect matching on the Dremio side).

Is there a different way that I can achieve this in Dremio?

balaji.ramaswamy · November 17, 2020, 4:04am

@pazaan

Great question.

Can you please share the profile so we can suggest?

pazaan · November 18, 2020, 2:55am

Sure! 9513b997-e7f0-4ef8-add2-037cf88234cc.zip (51.7 KB)

balaji.ramaswamy · November 18, 2020, 5:26am

@pazaan

It does look like the FILTER is getting push down to Mongo

MongoScan(table=[[QA2, data, deviceData]], MongoQuery=[use data; db.deviceData.find({“$and”: [{“$and”: [{“type”: {“$ne”: null}}, {“type”: {“$eq”: “upload”}}]}, {“_active”: true}]}, {“type”: 1, “_active”: 1, “uploadId”: 1, “_userId”: 1})])

Query ran for 2m49s out of which 2m30s was spent waiting for Mongo to return records. I see that the plan is in a single phase and hence single threaded (Phase 0 is single threaded). Is 2m49s for the background refresh not acceptable?

Thanks
Bali

pazaan · November 24, 2020, 5:32am

That’s passing down the projection for the _userId, but what I had hoped to see is something like:

use data; db.deviceData.find({"$and": [{"$and": [{“type”: {"$ne": null}}, {“type”: {"$eq": “upload”}}]}, {"_active": true}, { "_userId": {"$in": ["userId1", "userId2"]}}]}, {“type”: 1, “_active”: 1, “uploadId”: 1, “_userId”: 1})])

balaji.ramaswamy · November 24, 2020, 8:00am

@pazaan

I do not find a filter for userid but only find it for type and active, am I missing something?

pazaan · November 26, 2020, 12:32am

Right. Hence my initial comment asking whether there was a way for that filter to come from the subselect statement:
SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’

balaji.ramaswamy · November 26, 2020, 6:17am

@pazaan

If you run the push down in the way you are suggesting, does the query run faster ? < 3 minutes. If the performance is much better then that would require a code change so wondering what is the performance impact of the current push down?

Thanks
Bali

Topic		Replies	Views
Incorrect Query optimization for subselect	3	689	June 6, 2023
Dremio Reflection Incremental Refresh	12	3209	July 29, 2019
Reflection used only when sql splitted in subqueries	12	2139	November 9, 2018
Is there a guide for creating Reflections for a specific query?	1	1039	November 27, 2018
Dremio SELECT * on joins	2	5228	November 6, 2018

Optimized subselect for Mongo

Related topics