JSON schema infers only on a subset

Jrmyy · October 19, 2022, 4:00pm

Hello,

I have a s3 folder which contains 3 years worth of data in JSON format. I tried to create a Physical Dataset based on this folder but I can see only a part of the total fields the gathered JSON have. Indeed, we added, during the time, new fields in our data, leading to differences in the number of keys in the JSON files. We only append new keys in our JSONs, we did not remove any data.

For instance we introduced a new field in 2022 called visitor_id, but this field is not being picked up when Dremio infers the schema.

How can we proceed to have Dremio pick all the possibles columns (meaning all the keys in all the JSON files) ? I already tried to refresh the metadata of the physical dataset but I did not work.

Is it because there is a split limitations in S3 for data other than Parquet, Iceberg and Delta Lake ?

balaji.ramaswamy · October 30, 2022, 5:45pm

@Jrmyy Can you try to query the file that has the extra columns either by supplying a WHERE clause that will touch that file or simply do a “select *” and hit run and the column should be learnt

Yes, you are right, in case of data lake formats, the schema learning is on all files as opposed to types like JSON where a sample is taken

Topic		Replies	Views
Dremio is not able to infer the complete schema from gz compressed json files	1	1306	September 8, 2020
Couple of queries on Dremio Dremio University	3	1399	August 23, 2020
Can Dremio process Big Data actually?	10	1483	June 19, 2018
PDS Parquet Schema incorrectly learned after METADATA Refresh	12	1500	October 19, 2022
Is schema learning limited to 10 passes	11	3030	April 6, 2020

JSON schema infers only on a subset

Related topics