New schema found. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema

Hi,

My team and I would sometimes encounter this error message when posting queries using the REST API:

We guessed that the cause of this error may be due to the fact that the physical datasets that we are trying to query have nested json fields (Thus, multiple query attempts might be needed for schema learning).

We spotted that, at times, queries that require schema learning would be (automatically) re-attempted by Dremio (as seen in the screenshot below), while at other times, no (automatic) re-attempts will be performed and the error will immediately be thrown out (as seen in the previous screenshot), which we find this behaviour quite peculiar.

We are currently using Dremio version 4.7.2 (Community Edition). We have read in the release notes for Dremio version 4.5.0 that this issue was resolved by only pushing down projections that are simple column references.

Hence, we would like to check if this issue we faced is a bug and if there is any way to allow limited auto re-attempts of the query for schema learning before Dremio would throw back the error message.

Here are the job profiles of the two screenshots to aid in the debugging of this issue:

Job profile of the failed query:
389394d2-c16b-4219-b6da-afb3ac59ce26.zip (15.4 KB)

Job profile of the successful query after automatic schema learning re-attempt(s):
9aec65a0-c715-423f-8e04-7e214253537b.zip (31.8 KB)

Thank you.

Same problem.

I’m trying to build a raw reflection over a mongodb collection with millions of documents. One field contains a variable json structure.

@fdellutri @edksk If you have changing schema then Dremio tries 10 times and stops. However, it does not forget what it has learnt so rerunning the query will start from where it left off

Is there a reason the schemas are so heterogeneous?

In my case, the mongodb collection has a field that stores a nested, variable structure. In such case, is there any solution I could follow (e.g. a schema transformation?)

@fdellutri If the struct is constantly changing schema, currently we cannot do much, end of Q3 we are coming up with enhancements so you can define your own schema and that should address this

Is there any update on this, we keep running into that issue and was wondering if there was a way to define a schema

@OmarSultan85 Internal schema is a feature we are working on and will be later this year

Hi, I have a Nessie table and this error keeps popping up every time I try to access the json field. You can very easily reproduce by creating a table in a nessie endpoint like this:

CREATE TABLE Nessie.my_catalog.my_schema.my_table AT BRANCH main
AS (
    SELECT 
        1 AS id, 
        'Example Name' AS name, 
        '[{"key": "value1"}, {"key": "value2"}]' AS json_data    
)

and then query like this:

SELECT CONVERT_FROM(json_data, 'JSON') FROM Nessie.my_catalog.my_schema.my_table

It ends up with this, doesn’t matter whether run it twice or more often:

New field in the schema found. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.

Is this a known behaviour? Is there any possibility to handle json fields using the nessie endpoint?

Thank you in advance!

@styx0r Not sure if it is related to Nessie, does this work if you try the repro as a regular Non-Nessie table like Parquet or JSON on file on S3 or Hive?

@balaji.ramaswamy thx for your answer. Yes it works. I tried a regular minio-s3 endpoint with parquet and iceberg. Both of them behave like expected. That’s why I concluded it’s related to Nessie.

Hi @styx0r,

Did you find any resolution or work around to proceed? I am facing the exact same issue in my use case with Nessie catalog with Dremio.

I would appreciate if you could share some details.

Thanks.

@anupam DO you know for a fact that the same column name has different data types across Parquet files?

@balaji.ramaswamy The issue appears for a single record too with different keys. If the json_data column has a value like this “[{“city”: “San Jose”}, {“state”: “CA”}]”, the issue is reproducible.

It occurs when this data is coming from a column in Iceberg table with Nessie catalog.

This works:
SELECT CONVERT_FROM(json_data, ‘JSON’) FROM “data.parquet”;

Created an Iceberg table:
CREATE TABLE nessie.my_ns.test_table AS SELECT * FROM “data.parquet”;
SELECT * FROM nessie.my_ns.test_table;

But this fails even after multiple attempts,
SELECT CONVERT_FROM(json_data, ‘JSON’) FROM nessie.my_ns.test_table;

Error:
New field in the schema found. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.

@anupam Let me check if there are any specific issues with Nessie catalog on schema learning