PDS Parquet Schema incorrectly learned after METADATA Refresh

Hello

Using Dremio 22.0.0

I have an S3 location where parquet files (generated from a Protobuf Schema, if that matters) containing analytics data are added on a daily basis. Files are added to a daily partition some/prefix/YYYY/MM/dd

The S3 PDS is set to refresh metadata every 3 hours automatically in order to discover newly added files for queries to be made aware of them.

Over the course of time, the schema keeps evolving, and the payload expends to contain more fields and nested structs (schema itself is forward compatible, no breaking changes, of course)

Queries are run on top of the root path (say, querying 90 days worth of data using dir0 / dir1 data set filtering).

Immediately after the schema refresh, it seems that schema learned by Dremio uses only parquet files containing the older version 1 of the schema (I can confirm that from running dremio in debug mode where I see the protobuf schema used is the old one). Let’s say it has only field A. Later files using the newer version 2 of the schema, containing field A as well as field B are not used and dremio is not aware of them.

If I try and run:

SELECT A,B FROM "s3"."some"."prefix"

I get the error

Column 'B' not found in any table

However, if I change the query to

SELECT * FROM "s3"."some"."prefix"

Dremio schema learning kicks in, it notices the schema changes and creates the cumulative schema.
After that, the original query looking for columns A and B returns to work. That each happens time the metadata refreshes, and can also be easily recreated by issuing

ALTER PDS "s3"."some"."prefix" REFRESH METADATA

Any idea on how to get Dremio schema learning to properly kick in on specific fields query, or how to force proper schema learning on metadata refresh?

Hi @sheinbergon

I guess you either on an older version (< 18.x) or between 18.x and 21.x and have unlimited splits turned off or your file format if not PARQUET/ORC/AVRO. This behavior is expected if this falls in one of the the above 3.

Thanks
Bali

  • Version is 22.0.0
  • Format are parquet files (generated from protobuf schema)
  • Unlimited splits is turned on

@balaji.ramaswamy would love to get you reference

@sheinbergon Can you send me the profile for the REFRESH DATASET internal job that gets created when the ALTER PDS is executed?

@balaji.ramaswamy I sent you the profile privately

@balaji.ramaswamy I followed up on the issue privately

@sheinbergon See above, I need the refresh dataset internal job profile (click on jobs and select internal and search jobs type “REFRESH DATASET” (in upper case))

The SQL will be

REFRESH DATASET "centralized-logging"."production"."payloads"."four-map-analytics-payload"

@balaji.ramaswamy I followed up on the issue privately several times and sent you the correct profile.
Can you please relate there or here as to what might be the issue?

@sheinbergon Yes found the right profile. I see Dremio has learnt 987 files in total. Is this the right number of files under that folder?

Yes, that sounds just about right. I also checked the debug logs and it was loading files with the new updated schema as well as old ones (which he drew the schema from)

@sheinbergon So Dremio is readin all the 987 files but not recognizing column “B”, if you explicitly call it? Would it be the case that you are querying from the VDS and not the PDS? Can you send me (privately) the profile of the select that threw the error that column “B” does not exists

Yes, that’s correct.
I do explicitly call is part of the query (call column “B”) and it fails.

I’m running the same query the VDS runs on top of the PDS myself and get the same behavior exactly, so it’s not a VDS issue.

Again, if just call SELECT * on top of the PDS - proper schema learning is triggered.

I’ll send you the query profile privately, not a problen.