The only clean work around I think would be to have Dremio bypass schema learning and allow the end user to pass in a JSON schema (http://json-schema.org/) or Apache Arrow schema (https://arrow.apache.org/docs/metadata.html)
This can get really complex if a null value isn’t a simple data type, but an object like an array or object.
A temporary work around would be to have Dremio ignore null values. This way when it does learn schema on the fly it will correctly modify the schema instead of doing something silly like turning a null value into an boolaen by default, etc…
Here’s a modified example that has consistently crashed my server in the past:
File 1: martialStatus is a column of type string
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : “civil partnership”
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20
}]
File 2: martialStatus is a column of type boolean since there are no samples to infer the datatype.
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20
}]
Now try to have Dremio query both files in the same directory as a single data set… It turns martialStatus into an alpha numeric column or just crashes if you have really complex json. What happens if martialStatus isn’t a string, but a key like martialStatus: {“spouse”: ???, “yearMarried”: ???} which happens to be null across an entire file.