Hello
Using Dremio 22.0.0
I have an S3 location where parquet files (generated from a Protobuf Schema, if that matters) containing analytics data are added on a daily basis. Files are added to a daily partition some/prefix/YYYY/MM/dd
The S3 PDS is set to refresh metadata every 3 hours automatically in order to discover newly added files for queries to be made aware of them.
Over the course of time, the schema keeps evolving, and the payload expends to contain more fields and nested structs (schema itself is forward compatible, no breaking changes, of course)
Queries are run on top of the root path (say, querying 90 days worth of data using dir0
/ dir1
data set filtering).
Immediately after the schema refresh, it seems that schema learned by Dremio uses only parquet files containing the older version 1
of the schema (I can confirm that from running dremio in debug mode where I see the protobuf schema used is the old one). Let’s say it has only field A
. Later files using the newer version 2
of the schema, containing field A
as well as field B
are not used and dremio is not aware of them.
If I try and run:
SELECT A,B FROM "s3"."some"."prefix"
I get the error
Column 'B' not found in any table
However, if I change the query to
SELECT * FROM "s3"."some"."prefix"
Dremio schema learning kicks in, it notices the schema changes and creates the cumulative schema.
After that, the original query looking for columns A
and B
returns to work. That each happens time the metadata refreshes, and can also be easily recreated by issuing
ALTER PDS "s3"."some"."prefix" REFRESH METADATA
Any idea on how to get Dremio schema learning to properly kick in on specific fields query, or how to force proper schema learning on metadata refresh?