I’m wanting to use Dremio to be able to make and use reflections that help me optimize the underlying queries to Mongo (by using the indexes I have on our operational database).
For example, I’d like to have a table with a reflection for all of our user IDs. I have this set up, and no problem so far.
I would like to use one of the indexes on the Mongo collection that is based on userID. So, I thought that I’d be clever and do something like: `SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’", hoping that Dremio would evaluate the subselect first and use Mongo’s “$in” operator, or even better - parallelize the requests to Mongo.
What happens is that Dremio does a full collection scan on the Mongo collection, because it’s only requesting the “type” query to Mongo (it must be doing the subselect matching on the Dremio side).
Is there a different way that I can achieve this in Dremio?
Query ran for 2m49s out of which 2m30s was spent waiting for Mongo to return records. I see that the plan is in a single phase and hence single threaded (Phase 0 is single threaded). Is 2m49s for the background refresh not acceptable?
Right. Hence my initial comment asking whether there was a way for that filter to come from the subselect statement: SELECT field1, field2 FROM source.db.collection WHERE userId IN (select userId from Test.“users”) AND type = ‘data_type’
If you run the push down in the way you are suggesting, does the query run faster ? < 3 minutes. If the performance is much better then that would require a code change so wondering what is the performance impact of the current push down?