MongoDB Filtering Bug? (using Dremio 3.0.0, MongoDB 3.6.8)

Hi Dremio Team and wider Community!

I’m running into a weird problem with filtering: I have a collection in MongoDB that contains documents with varying formats. Some look like this:

{
    "_id": "100_20181010",
    "time": {
      "date": "2018-10-10T00:00:00.000Z",
      "offset": 0
    },
  // ....
     "subscribedApplications": [
      "someData", "someOtherData"
    ]
  }

SubscribedApplications might also be empty, i.e. []. Note that “date” has the time part set to 00:00:00.

Most of the documents, however, don’t have the “subscribedApplications” element at all, but look like this:

 {
    "_id": "362",
    "time": {
      "date": "2013-06-22T15:03:14.000Z",
      "offset": 120
    },
    "source": "361",
    "type": "c8y_TemperatureMeasurement",
   //....
  }

So, here we have Date+Timestamp in “date”, while the above version had time set to 00:00:00…
Dremio loads the raw data as follows:


For some reason, the “date” part is always incorrectly interpreted as a Date instead of DateTime (or String), removing the time part completely.

I’m only interested in the elements without “subscribedApplications”, but I need the timestamp, so I tried to filter and cast, like that:

 SELECT "_id", cast(pmdata."time"."date" AS TIMESTAMP) AS "time", source, type, flatten("_seriesValueFragments") AS "_seriesValueFragments"
  FROM C8Y_Edge.management.pmdata
  WHERE subscribedApplications = NULL

I saved that as a virtual dataset (“measurements”) – timestamp is casted properly. Then, I tried to execute a few queries. Things like select *, count(), …, work fine, but filters are completely broken.
A query like

SELECT *
FROM c8y.measurements
WHERE "_id"='362'

returns an empty result set, but it should return exactly one record (same for source=‘361’ and so on).

It gets even worse when I further unnest the JSON. Then, filtering for things like “value” > 10 just don’t work, full result set (including records with value < 10) is returned…

The problem goes away when I remove “WHERE subscribedApplications = NULL” from my virtual dataset definition, but then I cannot cast to TIMESTAMP anymore. Manually excluding the first record type (fortunately, they all have source=100) also fixes the filtering problem…

Any idea what might be wrong here? I attach a small JSON export from the Mongo collection.
raw_data.zip (1.2 KB)
The full virtual dataset DDL:
virtualDataSet.zip (425 Bytes)

Query profile for a “_id”=X query:
id_eq_411.zip (6.8 KB)

Query profile for a “value > X” query:
val_gt_10.zip (7.4 KB)

Thanks, Tim

Try clicking Run instead of Preview. When you click Preview it uses the $sample operator, and my guess is that nothing matches in the sample.

FYI, here’s what is pushed down into mongo:

`db.pmdata.aggregate([{ “$project” : { “_id” : 1, “time” : 1, “source” : 1, “type” : 1, “_seriesValueFragments” : 1, “subscribedApplications” : 1 } }, { “$sample” : { “size” : 1000 } }, { “$match” : { “subscribedApplications” : { “$eq” : null } } }, { “project" : { "expr000" : "_seriesValueFragments”, “expr001” : “$_id”, “expr002” : “$time.date”, “expr003” : “$source”, “expr004” : “$type” } }, { “$unwind” : “$expr000” }, { “$project” : { “_id” : “$expr001”, “time” : “$expr002”, “source” : “$expr003”, “type” : “$expr004”, “_seriesValueFragments” : “$expr000” } }]);

Also, as you get further along, you may want to experiment with Data Reflections to improve performance.`

Hi Kelly,

thanks for the quick reply. The problem is not related to using Preview mode. I usually connect via JDBC to do testing.

For the VirtualDataset with the filter " WHERE subscribedApplications = NULL", reflections don’t work. The behavior is quite strange actually: the Job UI shows that the reflection has been successfully created (in < 1 second), but in the Dataset’s Reflection View, the “in progress” icon keeps animating. Also, the queries (again, via JDBC) don’t use the reflection afterwards.

The screenshot below demonstrates the problem via JDBC connection. The only difference between virtual dataset “measurements” and “measurements_bug” is that the first filters via " WHERE source <> '100" (WORKAROUND) and the latter filters via “WHERE subscribedApplications = NULL”

image

Best, Tim

Or to stick with the very simple example of filtering for a specific “source” (all records in the virtual dataset have source=‘361’), here’s another screenshot:

Which version of Dremio is this?

Hi, @kelly!

I just upgraded from 2.1.X to the latest and greatest 3.0.0 for the PoC I’m doing. If it helps, I can do the same test with 2.1.6 or something.

Best, Tim

Hi @tid

Thanks a lot for uploading the json file. When I added this to a collection in MongoDB and tried to query through Dremio I am getting expected results. See attached screenshot. Would you be able to send me the verbose profile for the below query ?

To turn on verbose profile click on admin-support(on the left side)-On the right side in the support key box enter “planner.verbose.profile” and click show and turn it on

Here is the query

SELECT * FROM localmongo.test.rawdata WHERE subscribedApplications = NULL

Screenshot attached from my UI

Hi, @balaji.ramaswamy!

Thanks for your feedback.

The problem I’m reporting occurs when I build a virtual dataset on top of the imported data/raw data from Mongo. So, if you save
"SELECT * FROM localmongo.test.rawdata WHERE subscribedApplications = NULL"
as a virtual dataset, let’s say, subscribed_null and you then query the new virtual dataset, you’ll see that filtering is broken.

Example:
select * from subscribed_null where source='361'"
will return an empty result (but should return ALL records).

“planner.verbose.profile” seems to be an unknown support key, I cannot enable it.

Thanks, Tim

Hi @tid

Sorry that was a typo, should be “planner.verbose_profile”. Let me create the VDS and steps you have outlined and get back to you

Thanks
@balaji.ramaswamy