Unable to Join by ObjectId in MongoDB

I have consistently been getting errors when trying to join two collections from MongoDB by ObjectId. Is this a known issue?

Below is the stack trace/log. Is there an issue tracker where one can see what other known issues (and/or workarounds) there are?

SYSTEM ERROR: IllegalArgumentException: can not truncate buffer to a larger size 4: 0

SqlOperatorImpl UNORDERED_RECEIVER
Location 3:0:1
Fragment 3:0

[Error Id: a51dbba3-27be-4021-8999-91bfef6d2365 on dremio.lxd:31010]

  (java.lang.IllegalArgumentException) Could not load buffers for field $data$: List<$data$: List<$data$: FloatingPoint(DOUBLE)>>. error message: can not truncate buffer to a larger size 4: 0
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():177
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.load():155
    com.dremio.exec.record.ArrowRecordBatchLoader.load():97
    com.dremio.sabot.op.receiver.unordered.UnorderedReceiverOperator.outputData():101
    com.dremio.sabot.driver.SmartOp$SmartProducer.outputData():493
    com.dremio.sabot.driver.StraightPipe.pump():56
    com.dremio.sabot.driver.Pipeline.doPump():82
    com.dremio.sabot.driver.Pipeline.pumpOnce():72
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():279
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():275
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1807
    com.dremio.sabot.exec.fragment.FragmentExecutor.run():238
    com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():81
    com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():567
    com.dremio.sabot.task.AsyncTaskWrapper.run():91
    com.dremio.sabot.task.slicing.SlicingThread.run():71
  Caused By (java.lang.IllegalArgumentException) can not truncate buffer to a larger size 4: 0
    org.apache.arrow.vector.BaseDataValueVector.truncateBufferBasedOnSize():51
    org.apache.arrow.vector.complex.ListVector.loadFieldBuffers():117
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():175
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.loadBuffers():187
    com.dremio.exec.record.ArrowRecordBatchLoader.load():155
    com.dremio.exec.record.ArrowRecordBatchLoader.load():97
    com.dremio.sabot.op.receiver.unordered.UnorderedReceiverOperator.outputData():101
    com.dremio.sabot.driver.SmartOp$SmartProducer.outputData():493
    com.dremio.sabot.driver.StraightPipe.pump():56
    com.dremio.sabot.driver.Pipeline.doPump():82
    com.dremio.sabot.driver.Pipeline.pumpOnce():72
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():279
    com.dremio.sabot.exec.fragment.FragmentExecutor$DoAsPumper.run():275
    java.security.AccessController.doPrivileged():-2
    javax.security.auth.Subject.doAs():422
    org.apache.hadoop.security.UserGroupInformation.doAs():1807
    com.dremio.sabot.exec.fragment.FragmentExecutor.run():238
    com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():81
    com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():567
    com.dremio.sabot.task.AsyncTaskWrapper.run():91
    com.dremio.sabot.task.slicing.SlicingThread.run():71

Hi,

Could you share with us a profile (see this tutorial) for one of your failing queries so we can investigate the issue further?

Thanks,
Doron

I just tried this with the Yelp dataset by performing a dump/restore of the business collection to make a new business_copy collection. Then I wrote the following, with an inner join on _id, with filters on an array as well as a Boolean:

SELECT business_id, full_address, hours, nested_1."open" AS "open", categories, city, review_count, name, neighborhoods, longitude, state, stars, latitude, attributes, type, "_id", "_id0", business_id0, full_address0, hours0, open0, categories0, city0, review_count0, name0, neighborhoods0, longitude0, state0, stars0, latitude0, attributes0, type0
FROM (
  SELECT nested_0.business_id AS business_id, nested_0.full_address AS full_address, nested_0.hours AS hours, nested_0."open" AS "open", flatten(nested_0.categories) AS categories, nested_0.city AS city, nested_0.review_count AS review_count, nested_0.name AS name, nested_0.neighborhoods AS neighborhoods, nested_0.longitude AS longitude, nested_0.state AS state, nested_0.stars AS stars, nested_0.latitude AS latitude, nested_0.attributes AS attributes, nested_0.type AS type, nested_0."_id" AS "_id", join_business_copy."_id" AS "_id0", join_business_copy.business_id AS business_id0, join_business_copy.full_address AS full_address0, join_business_copy.hours AS hours0, join_business_copy."open" AS open0, join_business_copy.categories AS categories0, join_business_copy.city AS city0, join_business_copy.review_count AS review_count0, join_business_copy.name AS name0, join_business_copy.neighborhoods AS neighborhoods0, join_business_copy.longitude AS longitude0, join_business_copy.state AS state0, join_business_copy.stars AS stars0, join_business_copy.latitude AS latitude0, join_business_copy.attributes AS attributes0, join_business_copy.type AS type0
  FROM (
    SELECT "_id", business_id, full_address, hours, business."open" AS "open", categories, city, review_count, name, neighborhoods, longitude, state, stars, latitude, attributes, type
    FROM mongo2.yelp.business
  ) nested_0
   INNER JOIN mongo2.yelp.business_copy AS join_business_copy ON nested_0."_id" = join_business_copy."_id"
   WHERE nested_0."open" = true
) nested_1
 WHERE categories IN ('Shopping', 'Restaurants', 'Event Planning & Services')

Dremio pushed two queries down to mongo, one for each collection, then performed the join in memory. For the collection with predicates on the array and one of the Boolean columns, the query pushed down was:

use yelp; 
db.business.aggregate([{ "$match" : { "open" : { "$eq" : true } } }, { "$project" : { "expr000" : "$categories", "expr001" : "$_id", "expr002" : "$business_id", "expr003" : "$full_address", "expr004" : "$hours", "expr005" : "$city", "expr006" : "$review_count", "expr007" : "$name", "expr008" : "$neighborhoods", "expr009" : "$longitude", "expr010" : "$state", "expr011" : "$stars", "expr012" : "$latitude", "expr013" : "$attributes", "expr014" : "$type" } }, { "$unwind" : "$expr000" }, { "$match" : { "$or" : [{ "expr000" : { "$eq" : "Shopping" } }, { "expr000" : { "$eq" : "Restaurants" } }, { "expr000" : { "$eq" : "Event Planning & Services" } }] } }, { "$project" : { "_id" : "$expr001", "business_id" : "$expr002", "full_address" : "$expr003", "hours" : "$expr004", "city" : "$expr005", "review_count" : "$expr006", "name" : "$expr007", "neighborhoods" : "$expr008", "longitude" : "$expr009", "state" : "$expr010", "stars" : "$expr011", "latitude" : "$expr012", "attributes" : "$expr013", "type" : "$expr014", "FLATTEN" : "$expr000" } }])

The other side of the join was:

use yelp; 
db.business_copy.find({}, { "_id" : 1, "business_id" : 1, "full_address" : 1, "hours" : 1, "open" : 1, "categories" : 1, "city" : 1, "review_count" : 1, "name" : 1, "neighborhoods" : 1, "longitude" : 1, "state" : 1, "stars" : 1, "latitude" : 1, "attributes" : 1, "type" : 1 })

This worked fine on 3.2.16. What version are you on?

Looking at the stack trace you posted, it seems to me that your query is exposing some bug, and it may not actually be related to the mongo connector specifically. Posting the profile like Doron mentioned could help shed more light on the issue for us.

Hi @doron, my apologies for the late reply. The profile can be found at https://1drv.ms/u/s!Ai6VQKxJEcVN703R9V7py9FmSK7q.

@kelly I am on version 3.4.6, I don’t know how the Yelp dataset looks like (specifically if _id is an ObjectId), but I can’t join wherever a field is an ObjectId.

Thanks

@steven Any update on this? Is there a way to convert Mongo’s ids to strings in Dremio?