Add store.json or store.parquet option for omit_nulls

When reading JSON can we have an option to omit null values?

I think a lot of the schema change issues is the result of having null values in JSON files.

Example: Two JSON files with addresses

a.jsonl
{“address”: “1 Lombard Street”, “city”: “San Francisco”, “phone”: “415-111-1111”}
{“address”: “2 Market Street”, “city”: “San Francisco”, “phone”: “415-222-2222”}

b.jsonl
{“address”: “3 Kearny Street”, “city”: “San Francisco”, “phone”: null}
{“address”: “4 Bush Street”, “city”: “San Francisco”, “phone”: null

I believe if a.jsonl is converted to a.parquet that phone will end up as a string column in the parquet file.
I believe if b.jsonl is converted to b.parquet that phone will end up as a int column in the parquet file.

Then if you try to read both files at the same time in the same directory it creates a schema change / inconsistency issue.

Having the option to exclude null values when reading JSON files should at least get rid of column data type inconsistencies.

1 Like

Yes please!

The alternative is storing all null values as empty strings which then poses a problem if other values are integers. Also reporting a schema change inconsistency error.

I have been chasing and fighting this ghost with the convert_from(data, ‘JSON’) AS data function.

Also note that this function expects the schema to be exactly the same for every record. Therefore if you have two records like this, the query and any acceleration will fail.

{“address”: “3 Kearny Street”, “city”: “San Francisco”, “phone”: null}
{“address”: “4 Bush Street”, “city”: “San Francisco”}