Optimized subselect for Mongo

I’m wanting to use Dremio to be able to make and use reflections that help me optimize the underlying queries to Mongo (by using the indexes I have on our operational database).
For example, I’d like to have a table with a reflection for all of our user IDs. I have this set up, and no problem so far.
I would like to use one of the indexes on the Mongo collection that is based on userID. So, I thought that I’d be clever and do something like: `SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’", hoping that Dremio would evaluate the subselect first and use Mongo’s “$in” operator, or even better - parallelize the requests to Mongo.
What happens is that Dremio does a full collection scan on the Mongo collection, because it’s only requesting the “type” query to Mongo (it must be doing the subselect matching on the Dremio side).

Is there a different way that I can achieve this in Dremio?

@pazaan

Great question.

Can you please share the profile so we can suggest?

Sure! 9513b997-e7f0-4ef8-add2-037cf88234cc.zip (51.7 KB)

@pazaan

It does look like the FILTER is getting push down to Mongo

MongoScan(table=[[QA2, data, deviceData]], MongoQuery=[use data; db.deviceData.find({“$and”: [{“$and”: [{“type”: {“$ne”: null}}, {“type”: {“$eq”: “upload”}}]}, {“_active”: true}]}, {“type”: 1, “_active”: 1, “uploadId”: 1, “_userId”: 1})])

Query ran for 2m49s out of which 2m30s was spent waiting for Mongo to return records. I see that the plan is in a single phase and hence single threaded (Phase 0 is single threaded). Is 2m49s for the background refresh not acceptable?

Thanks
Bali

That’s passing down the projection for the _userId, but what I had hoped to see is something like:

use data; db.deviceData.find({"$and": [{"$and": [{“type”: {"$ne": null}}, {“type”: {"$eq": “upload”}}]}, {"_active": true}, { "_userId": {"$in": ["userId1", "userId2"]}}]}, {“type”: 1, “_active”: 1, “uploadId”: 1, “_userId”: 1})])

@pazaan

I do not find a filter for userid but only find it for type and active, am I missing something?

Right. Hence my initial comment asking whether there was a way for that filter to come from the subselect statement:
SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’

@pazaan

If you run the push down in the way you are suggesting, does the query run faster ? < 3 minutes. If the performance is much better then that would require a code change so wondering what is the performance impact of the current push down?

Thanks
Bali