Extracting Json field from source gives us mixed type

Hello everyone. We are working with a Dremio installation using Hive and HDFS as a source. The data we stock on the HDFS tables has various fields, one of those is a complex JSON. When we try and extract a specific subset of this JSON, it appears Dremio identifies the extracted field as a mixed type . Now, weirdly enough, the very same dataset once exported and re-imported in Dremio (as JSON) works as supposed and extracts the specifc subset as a struct. The subset in itself is a pretty standard and well formed JSON, all of the fields are populated (since Dremio has no way of learning from a given Schema as far as we know) or have an empty string value, i.e. STRUCTURE:{"field1":"String", "field2":"String", "fieldn":"String","fieldint":Int}. Any help would be thoroughly appreciated.

@will23332

Could you run a select is_struct(foo.bar) from datasetand see which rows are returned as false? That should provide some information as to which values are causing Dremio to consider it as a mixed type.

@doron
Thanks for the answer! I’m afraid that the command didn’t give us much insight, the select tells us that all of the subsets are considered struct, there are no false values. We also tried to select just the ones that where supposed to be false with a WHERE is_struct(parent.set.subset) = false and there were none.

Following https://docs.dremio.com/sql-reference/sql-commands/datasets.html, can you run ALTER PDS <PHYSICAL-DATASET-PATH> REFRESH METADATA FORCE UPDATE and see if that helps? By default Dremio will add new types to its metadata as it discovers field values - it could be that there used to be a row that had a non-struct value and Dremio will continue to think that the column is a mixed type. This SQL command will cause Dremio to forget the metadata and recalculate it.

If this is a production system and you can’t afford to refresh the metadata, you could also add the source again as a new source and see if the new one behaves any different.

Thank you so very much! The refresh didn’t work (weirdly enough) even though it went smoothly, adding the source again worked like a charm. It definitely had to do with how Dremio discovered that very field when it wrote its metadata.
Thank you again and have a nice day!

@will23332 You can force Dremio to forget the metadata for the existing source using this command - ALTER PDS <PHYSICAL-DATASET-PATH> FORGET METADATA.

1 Like