When reading JSON can we have an option to omit null values?
I think a lot of the schema change issues is the result of having null values in JSON files.
Example: Two JSON files with addresses
a.jsonl
{“address”: “1 Lombard Street”, “city”: “San Francisco”, “phone”: “415-111-1111”}
{“address”: “2 Market Street”, “city”: “San Francisco”, “phone”: “415-222-2222”}
b.jsonl
{“address”: “3 Kearny Street”, “city”: “San Francisco”, “phone”: null}
{“address”: “4 Bush Street”, “city”: “San Francisco”, “phone”: null
I believe if a.jsonl is converted to a.parquet that phone will end up as a string column in the parquet file.
I believe if b.jsonl is converted to b.parquet that phone will end up as a int column in the parquet file.
Then if you try to read both files at the same time in the same directory it creates a schema change / inconsistency issue.
Having the option to exclude null values when reading JSON files should at least get rid of column data type inconsistencies.