Json data : null and missing fields

For json data sources, is there any support for either null values or missing fields?

I’ve tried using store.json.all_text_mode and it does not work. The field that has a null value is still ignored.

I also have example documents where a null value is represented in the document as a missing field.

At the moment this is a blocker to Dremio’s use in our Enterprise.

An interesting comparison is with Couchbase N1QL, this very neatly solves the null/missing issue.

Regards.

Dave.

Is it possible to share some sample JSON?

In the three objects below. In the first the maritalStatus is known, in the second it is null and in the third it is missing.

[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : “civil partnership”
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

Let us say that when designing the data set object 1 is not available. For whatever reason.

The only objects 1 or 2 available for the Dremio designer to work with, and it will not be possible to map the maritalStatus into the dataset.

Hey there David,

Yes this is a known behaviour of Dremio. For semi-structured data Dremio tries to “learn” the schema, when it first encounters a data-set. Any properties not appearing in that data-set (and for large data-sets it will sample the data-set, so it may be the case that the property does appear in one of the files/rows but is not included in the sample) are simply excluded from the schema, and there is currently no way to later add that extra property/column, without triggering a schema refresh.

So the current workaround is to refresh the schema once a new property emerges.

The only clean work around I think would be to have Dremio bypass schema learning and allow the end user to pass in a JSON schema (http://json-schema.org/) or Apache Arrow schema (https://arrow.apache.org/docs/metadata.html)

This can get really complex if a null value isn’t a simple data type, but an object like an array or object.

A temporary work around would be to have Dremio ignore null values. This way when it does learn schema on the fly it will correctly modify the schema instead of doing something silly like turning a null value into an boolaen by default, etc…

Here’s a modified example that has consistently crashed my server in the past:

File 1: martialStatus is a column of type string
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : “civil partnership”
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

File 2: martialStatus is a column of type boolean since there are no samples to infer the datatype.
[{
“name”: “Smith, Ted”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Jones, J.”
,“age” : 20
,“maritalStatus” : null
},
{
“name”: “Doe, Jane”
,“age” : 20

}]

Now try to have Dremio query both files in the same directory as a single data set… It turns martialStatus into an alpha numeric column or just crashes if you have really complex json. What happens if martialStatus isn’t a string, but a key like martialStatus: {“spouse”: ???, “yearMarried”: ???} which happens to be null across an entire file.

Thanks Christy.

Refreshing the schema when a new property emerges is not possible.

The data sets that we use have sparsely populated fields, and on any given point in time you are not guaranteed to have all fields populated. In fact in some cases it is logically not possible to have all the fields populated in a single object.

As david.lee says the perfect implementation would be for Dremio to use a JSON schema as the way to create the data set.

As a work around, I will created fictitious JSON files, where all the fields are artificially populated.

Regards.

Dave.