Json data : null and missing fields

david.richardson · June 29, 2018, 10:44am

For json data sources, is there any support for either null values or missing fields?

I’ve tried using store.json.all_text_mode and it does not work. The field that has a null value is still ignored.

I also have example documents where a null value is represented in the document as a missing field.

At the moment this is a blocker to Dremio’s use in our Enterprise.

An interesting comparison is with Couchbase N1QL, this very neatly solves the null/missing issue.

Regards.

Dave.

anthony · June 29, 2018, 11:50am

Is it possible to share some sample JSON?

david.richardson · June 29, 2018, 2:39pm

In the three objects below. In the first the maritalStatus is known, in the second it is null and in the third it is missing.

[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : “civil partnership”
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

Let us say that when designing the data set object 1 is not available. For whatever reason.

The only objects 1 or 2 available for the Dremio designer to work with, and it will not be possible to map the maritalStatus into the dataset.

christy · June 29, 2018, 4:37pm

Hey there David,

Yes this is a known behaviour of Dremio. For semi-structured data Dremio tries to “learn” the schema, when it first encounters a data-set. Any properties not appearing in that data-set (and for large data-sets it will sample the data-set, so it may be the case that the property does appear in one of the files/rows but is not included in the sample) are simply excluded from the schema, and there is currently no way to later add that extra property/column, without triggering a schema refresh.

So the current workaround is to refresh the schema once a new property emerges.

david.lee · June 29, 2018, 7:34pm

The only clean work around I think would be to have Dremio bypass schema learning and allow the end user to pass in a JSON schema (http://json-schema.org/) or Apache Arrow schema (https://arrow.apache.org/docs/metadata.html)

This can get really complex if a null value isn’t a simple data type, but an object like an array or object.

A temporary work around would be to have Dremio ignore null values. This way when it does learn schema on the fly it will correctly modify the schema instead of doing something silly like turning a null value into an boolaen by default, etc…

Here’s a modified example that has consistently crashed my server in the past:

File 1: martialStatus is a column of type string
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : “civil partnership”
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

File 2: martialStatus is a column of type boolean since there are no samples to infer the datatype.
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

Now try to have Dremio query both files in the same directory as a single data set… It turns martialStatus into an alpha numeric column or just crashes if you have really complex json. What happens if martialStatus isn’t a string, but a key like martialStatus: {“spouse”: ???, “yearMarried”: ???} which happens to be null across an entire file.

david.richardson · July 2, 2018, 7:06am

Thanks Christy.

Refreshing the schema when a new property emerges is not possible.

The data sets that we use have sparsely populated fields, and on any given point in time you are not guaranteed to have all fields populated. In fact in some cases it is logically not possible to have all the fields populated in a single object.

As david.lee says the perfect implementation would be for Dremio to use a JSON schema as the way to create the data set.

As a work around, I will created fictitious JSON files, where all the fields are artificially populated.

Regards.

Dave.

Topic		Replies	Views
Json data: Dremio ignores attributes where values are null	0	21	February 27, 2025
Flatten json schema update showing error in dremio	4	1395	January 29, 2021
Dremio is not able to infer the complete schema from gz compressed json files	1	1300	September 8, 2020
Null values are not supported in lists by default. Please set `store.json.all_text_mode` to true to read lists containing nulls. Be advised that this will treat JSON null values as a string containing the word 'null'	19	4578	June 20, 2018
ERROR to transform in JSON	11	3770	December 5, 2024

Json data : null and missing fields

Related topics