I’ve been struggling to consume json market research data using Dremio, there are few fields in this data which are list of strings (for eg: favorite_brands), i’m trying to calculate average age by favorite_brands column.
select favorite_brands, avg(age) from my_table group by favorite_brands
What i’ve already tried:
- select FLATTEN(favorite_brands), avg(age) from my_table group by favorite_brands, which obviously didn’t work (Err: Dremio doesn’t support using FLATTEN in the GROUP BY clause.)
- UNNEST favorite_brands to create one new record per list value, below are few downsides of this approach.
a. Dremio removes the records where favorite_brands list is empty, resulting in data loss.
b. average/count/sum calculation on any of the measures are no more valid because of the duplicate records
c. data redundancy is exponential when there are 5-10 list fields in my data.
This scenario is valid when source is a nested json file, MongoDB or Elasticsearch, Is there an alternate way to calculate the same?