JSON schema infers only on a subset


I have a s3 folder which contains 3 years worth of data in JSON format. I tried to create a Physical Dataset based on this folder but I can see only a part of the total fields the gathered JSON have. Indeed, we added, during the time, new fields in our data, leading to differences in the number of keys in the JSON files. We only append new keys in our JSONs, we did not remove any data.

For instance we introduced a new field in 2022 called visitor_id, but this field is not being picked up when Dremio infers the schema.

How can we proceed to have Dremio pick all the possibles columns (meaning all the keys in all the JSON files) ? I already tried to refresh the metadata of the physical dataset but I did not work.

Is it because there is a split limitations in S3 for data other than Parquet, Iceberg and Delta Lake ?

@Jrmyy Can you try to query the file that has the extra columns either by supplying a WHERE clause that will touch that file or simply do a “select *” and hit run and the column should be learnt

Yes, you are right, in case of data lake formats, the schema learning is on all files as opposed to types like JSON where a sample is taken