Bug Report: Schema Discovery Problem with nested documents (Dremio 3.0.0 and MongoDB 3.6.8)

Hi Dremio Team,

I’m facing an issue with Schema Discovery (Preview works, Run broken) when using nested JSON structures in MongoDB. The case is pretty easy to reproduce. I add three JSON documents to a Mongo collection:

[
{
  "o" : {                                   
        "text": "a",       
        "num" : 42                             
      }
    },     
      { 
    "o":
      {                                   
        "text": "b",       
        "num": 43                             
      }
    },
    {
        "o":
      {                                   
        "text": "c",       
        "num": 44                             
      }
    }
]

I then execute the following query in PREVIEW mode:

SELECT CONCAT("coll"."o"."text", 'myString')
FROM mysource.mydb.coll

This runs fine – it returns all three “text” attribute values plus “myString”.
When I switch to “RUN” or execute via JDBC, I get:

Error: SCHEMA_CHANGE ERROR: New schema found and recorded. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.

Original Schema schema(o::struct<text::varchar>)
New Schema schema(o::struct<text::varchar, num::int32>)
SqlOperatorImpl MONGO_SUB_SCAN
Location 0:0:2
SqlOperatorImpl MONGO_SUB_SCAN
Location 0:0:2
Fragment 0:0

[Error Id: a37fca37-5532-462c-8f5e-d65d6e00a67c on MYMACHINE:31010]

  (org.apache.arrow.vector.util.SchemaChangeRuntimeException) Schema change error
    com.dremio.common.exceptions.UserException.schemaChangeError():88
    com.dremio.sabot.op.scan.ScanOperator.checkAndLearnSchema():261
    com.dremio.sabot.op.scan.ScanOperator.setupReader():178
    com.dremio.sabot.op.scan.ScanOperator.setup():163
    com.dremio.sabot.driver.SmartOp$SmartProducer.setup():560
    com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer():79
    com.dremio.sabot.driver.Pipe$SetupVisitor.visitProducer():63
    com.dremio.sabot.driver.SmartOp$SmartProducer.accept():530
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.StraightPipe.setup():102
    com.dremio.sabot.driver.Pipeline.setup():58
    com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():347
    com.dremio.sabot.exec.fragment.FragmentExecutor.run():237
    com.dremio.sabot.exec.fragment.FragmentExecutor.access$800():88
    com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():594
    com.dremio.sabot.task.AsyncTaskWrapper.run():103
    com.dremio.sabot.task.slicing.SlicingThread.run():110

SQLState:  null
ErrorCode: 0

When I try exactly the same with a JSON source (i.e., same data, but no MongoDB), I get no error.

Query Profile:
query_profile_concat_bug.zip (41.8 KB)

Sample JSON file to import into MongoDB:
concat.zip (262 Bytes)

Thanks, Tim

Has anyone experienced something similar? Are there any workarounds? My team is working on a PoC and this issue is kind of a showstopper to us.

Just for completeness: I simplified the query to make it easy to reproduce. We’re actually trying to do some more complex operations involving parsing timestamp, etc.

Thanks, Tim

Hi @tid

Thanks for uploading the profiles. Let me look into this and get back to you

Thanks
@balaji.ramaswamy

Hi Tim,

Looks like we are running into a minor bug here. We have started to look at this via an internal Engineering ticket and will keep you posted on the progress

Thanks
@balaji.ramaswamy

Thanks, @balaji.ramaswamy!

Glad to hear that the issue is reproducible and under investigation!

Best, Tim

Hi @tid

Are you using the RPM or the TAR version?

Thanks
@balaji.ramaswamy

Hi,
We’re running on CentOS 7.5, so RPM.

Thanks, Tim