Is schema learning limited to 10 passes

I’m getting the following error testing an extremely complex XML > JSON > Parquet converted file. The original XML file has 3,000 possible XML paths which are not necessary present in every record. Any ideas on how to learn the entire schema even if I have to scan the entire file

SCHEMA_CHANGE ERROR: New schema found and recorded. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.

After 10 tries it just quits.

Attempts
Attempt 10 (schema learning)
Profile »
Attempt 9 (schema learning)
Profile »
Attempt 8 (schema learning)
Profile »
Attempt 7 (schema learning)
Profile »
Attempt 6 (schema learning)
Profile »
Attempt 5 (schema learning)
Profile »
Attempt 4 (schema learning)
Profile »
Attempt 3 (schema learning)
Profile »
Attempt 2 (schema learning)
Profile »
Attempt 1
Profile »

On the last try it found the following difference:

GermanyTaxTransparence::boolean,
SwedenPMI::varchar,
UKIASector_Id::varchar,
UKIASector::varchar,

vs

GermanyTaxTransparence::boolean,
SwedenPMI::varchar,
UKAICSector::varchar,
UKAICSectorName::varchar,

UKIASector_Id::varchar,
UKIASector::varchar,

If you run the query multiple times it should eventually learn the schema

Schema learning is expensive so after 10 tries it stops and gives you a warning (which we currently display as an error, which is misleading, sorry about that). Dremio continues to learn on each subsequent run and will build on top of what it learned previously - it does not discard the gained knowledge.

1 Like

There’s definitely room for improvement here. Tried several times and couldn’t get a format to stick…

These parquet files (100 to 200 megs each) were created using Apache Drill reading two json files. The json files are created from xml files using a xsd schema definition file to figure out data types, arrays, nested key objects, etc…

I can create formats on each individual parquet file.
I can create a format on the xml_file_1 folder
After ten passes, trying to format the xml_file_2 folder fails. Multiple tries didn’t help and after 10 passes the server crashes on top of everything with “something went wrong”.

11150077 2017-10-10 13:33 /my_data_set/year=2017/month=09/xml_file_1/0_0_0.parquet
19113811 2017-11-03 15:40 /my_data_set//year=2017/month=09/xml_file_2/0_0_0.parquet
14098336 2017-11-03 15:40 /my_data_set//year=2017/month=09/xml_file_2/0_0_1.parquet

The goal here is to put a format on my_data_set and add data over time by year and month…

Ok this is odd running ALTER PDS made the format show up in purple over green…

ALTER PDS “HDFS”.my_data_set.".“year=2017”.“month=09”.“xml_file_2” REFRESH METADATA

Running this on the base folder appears to work as well.

ALTER PDS “HDFS”.my_data_set REFRESH METADATA

Hi,

Are there anything in the logs when it crashed, would be interesting to understand why that is happening.

Attached the downloaded profile. Is there anything specific from logs I should include… This is a pretty gross profile and I’m definitely pushing limits here…

10 schema passes took over an hour which is odd. If I just open a single parquet file or even a sub directory (by year, month, original file name) it figures out the structure in a minute or less…

89c1bfb9-1bad-41e5-bbae-86f5fc6c066e.zip (969.1 KB)

From the profile, it seems you have 73 files in total and indeed on each run Dremio is learning a new schema. The slow down has to do with the fact that we are reading 1 parquet file at a time and not parallelizing it. So we learn the schema of parquet file 1, see a schema change, learn the schema and re-run the query. Now it reads in the first file to apply the filter, then reads in parquet file 2, sees a schema change and repeats the process.

To help speed things up, you can set the advanced option planner.slice_target to 1.

You can set it by going to Admin in the top right of Dremio, choosing Advanced Settings. Scroll down to the Dremio Support section and enter planner.slice_target into the Support Key field and pressing show. Change the value in the new field to 1 and press save.

56 AM

This should speed things up and also allow multiple schemas to be learned on each run.

Schema learning is nice, but I think with really complex schemas it would be better to have the option to somehow create the Dremio metadata definitions by passing in some JSON string to a Dremio SQL Command. In my situation I’ve got over 3500 potential key values with multiple levels of nesting sitting in 300 meg parquet files.

@david.lee thanks for the feedback. We’re considering various options to make it easier to “bootstrap” dataset definitions.

It would be ideal if you could inject something like a JSON schema when querying JSON instead of relying on schema learning…

http://json-schema.org/

1 Like

“schema on-read” +1. Dealing with highly dynamic (originally nested JSON) data with thousands of possible key paths extracted from that nested JSON. JSON structure not controlled by us, vendor can start sending new key-values inside on a whim. Data analysts must be able to select new dynamic paths.

Comparing this to how Snowflake handles JSON data in Variant column types, where the contents is indexed in background opportunistically, but you don’t have to have a column for each possible key pre-defined.

__json_data_column::key1.key2.key3 as value_of_key_3

You can group / filter by ^

Realize some of this can be achieved by packing “dynamic” data into JSON string and parsing it with convert_from(v, 'JSON').

Would love to have both worlds joined where we can trigger / tip-off Dremio about what key-paths in Dynamic data to pre-index / expose in dot-notation.