PDS Parquet Schema incorrectly learned after METADATA Refresh

sheinbergon · September 24, 2022, 9:30pm

Hello

Using Dremio 22.0.0

I have an S3 location where parquet files (generated from a Protobuf Schema, if that matters) containing analytics data are added on a daily basis. Files are added to a daily partition some/prefix/YYYY/MM/dd

The S3 PDS is set to refresh metadata every 3 hours automatically in order to discover newly added files for queries to be made aware of them.

Over the course of time, the schema keeps evolving, and the payload expends to contain more fields and nested structs (schema itself is forward compatible, no breaking changes, of course)

Queries are run on top of the root path (say, querying 90 days worth of data using dir0 / dir1 data set filtering).

Immediately after the schema refresh, it seems that schema learned by Dremio uses only parquet files containing the older version 1 of the schema (I can confirm that from running dremio in debug mode where I see the protobuf schema used is the old one). Let’s say it has only field A. Later files using the newer version 2 of the schema, containing field A as well as field B are not used and dremio is not aware of them.

If I try and run:

SELECT A,B FROM "s3"."some"."prefix"

I get the error

Column 'B' not found in any table

However, if I change the query to

SELECT * FROM "s3"."some"."prefix"

Dremio schema learning kicks in, it notices the schema changes and creates the cumulative schema.
After that, the original query looking for columns A and B returns to work. That each happens time the metadata refreshes, and can also be easily recreated by issuing

ALTER PDS "s3"."some"."prefix" REFRESH METADATA

Any idea on how to get Dremio schema learning to properly kick in on specific fields query, or how to force proper schema learning on metadata refresh?

balaji.ramaswamy · September 26, 2022, 5:17am

Hi @sheinbergon

I guess you either on an older version (< 18.x) or between 18.x and 21.x and have unlimited splits turned off or your file format if not PARQUET/ORC/AVRO. This behavior is expected if this falls in one of the the above 3.

Thanks
Bali

sheinbergon · September 26, 2022, 1:55pm

Version is 22.0.0
Format are parquet files (generated from protobuf schema)
Unlimited splits is turned on

sheinbergon · September 28, 2022, 7:41am

@balaji.ramaswamy would love to get you reference

balaji.ramaswamy · September 29, 2022, 1:37am

@sheinbergon Can you send me the profile for the REFRESH DATASET internal job that gets created when the ALTER PDS is executed?

sheinbergon · October 3, 2022, 1:25pm

@balaji.ramaswamy I sent you the profile privately

sheinbergon · October 7, 2022, 2:58pm

@balaji.ramaswamy I followed up on the issue privately

balaji.ramaswamy · October 8, 2022, 4:11am

@sheinbergon See above, I need the refresh dataset internal job profile (click on jobs and select internal and search jobs type “REFRESH DATASET” (in upper case))

The SQL will be

REFRESH DATASET "centralized-logging"."production"."payloads"."four-map-analytics-payload"

sheinbergon · October 16, 2022, 9:17am

@balaji.ramaswamy I followed up on the issue privately several times and sent you the correct profile.
Can you please relate there or here as to what might be the issue?

balaji.ramaswamy · October 18, 2022, 5:22am

@sheinbergon Yes found the right profile. I see Dremio has learnt 987 files in total. Is this the right number of files under that folder?

sheinbergon · October 18, 2022, 9:00am

Yes, that sounds just about right. I also checked the debug logs and it was loading files with the new updated schema as well as old ones (which he drew the schema from)

balaji.ramaswamy · October 19, 2022, 4:36am

@sheinbergon So Dremio is readin all the 987 files but not recognizing column “B”, if you explicitly call it? Would it be the case that you are querying from the VDS and not the PDS? Can you send me (privately) the profile of the select that threw the error that column “B” does not exists

sheinbergon · October 19, 2022, 12:22pm

Yes, that’s correct.
I do explicitly call is part of the query (call column “B”) and it fails.

I’m running the same query the VDS runs on top of the PDS myself and get the same behavior exactly, so it’s not a VDS issue.

Again, if just call SELECT * on top of the PDS - proper schema learning is triggered.

I’ll send you the query profile privately, not a problen.

Topic		Replies	Views
Issues with a view (VDS) built using queries on an auto expiring/refreshing PDS	1	962	September 29, 2022
Parquet Schema Error @ OSS-only build	4	993	June 29, 2021
Is schema learning limited to 10 passes	11	3030	April 6, 2020
Schema Learning problems with spilt parquet files	3	2218	December 19, 2017
Physical Dataset not Auto Refreshing	4	919	June 21, 2023

PDS Parquet Schema incorrectly learned after METADATA Refresh

Related topics