We are using S3 as the source. The dataset contains a column (field_name) with values like ‘3349300447666044928’ which we are also handling / processing with Redshift and treat them as “varchar”.
We are using an on-premise Community Edition Dremio v4.1.8.
Now, when we try to query this dataset, we get the error in the title.
If I try to create a reflection on it, we just keep getting “New schema found. Please reattempt the query. Multiple attempts may be necessary to fully learn the schema.” (no “recoreded” mention though) - which we did more than 10 times, “disable / enable” the Reflection with the same result after the same job time / GBs - seems to get stuck at the same spot.
Is there any way to query this dataset through Dremio?
(We attempted at some point to upgrade Dremio, but we hit some hasty errors, and OOM errors on queries / reflections that are not OOMing on 4.1.8.)
If you really know that an upgrade would fix this issue - we will do it.
It seems like the files under the folder you re promoting has a very heterogenous schema. Unfortunately Dremio has to learn the schema and fails after 10 times, the good thing is Dremio does not throw away what it has learnt so you can continue to run the query a few times. If it still does not learn the schema then it is probably a good idea to look at the files and see why the schema changes so much.
On the later version we print what the mismatch is. Are you able to upgrade and send us the profile of schema learning and the OOM errors you are facing
Hey @balaji.ramaswamy, thank you for your answer.
The schema is not heterogenous at all.
It looks like a Dremio issue since it cannot “evolve” the found schema from “double” to “decimal(30,0)” - it’s decimal (30,0) - so no decimals - just a bigger numeric type. Or at least varchar, if it cannot evolve between those two.
Installed the latest Dremio version - the preview works fine (just luck) but the raw reflection simply fails now with the “Mixed types double , decimal(30,0) for field hitid_high are not supported.” error - no “new schema” retries.
What it looks like is that if we would have big strings compose only with digits, Dremio would attempt to set them as numeric in the schema, and it would fail with this error - making Dremio unable to use any dataset with long values with “digits” (value sample “6391735060709509184”)?
I attached the profile (not sure what’s up with the “Table ‘json’ not found” error in log_attempt_0.json - it’s a fresh Dremio installdremio_archive_profile.zip (39.2 KB) ).
Is your ask, that Dremio does an implicit conversion for the below?
UNSUPPORTED_OPERATION ERROR: Mixed types double , decimal(30,0) for field hitid_high are not supported.
Well Dremio is attempting to infer the schema I guess for the parquet soruce.
It starts somehow with a “double” and then for some reason (maybe larger digit “strings”) decides to go with decimal(30,0) - it’s just a raw reflection query.
is there any way to actually create a reflection or query this dataset with Dremio? I know there is no way to force a certain data type for physical dataset columns…
From my understanding for some files it’s inferring as “double” and for other files as “decimal(30,0)” - just bigger ints.
So without a way to hint Dremio on the data type for that column, I guess there’s no way to use din datasource with Dremio.
What is the datasource? You can cast the column to the type you wish and save that as a VDS. Then use the VDS for querying or if it needs a reflection, create it on the VDS
The datasource are parquet files in S3.
We tried that. We cannot query the PDS at all - we cannot make a VDS on top of it - it still errors out with the same error. It attempts to learn the schema and fails.
Any query we attempt on it, even trying to cast that column to VARCHAR in a VDS - it still tries to learn the schema of the PDS.
Just trying to make sure - in order to “reattempt the query” for reflections, in the community edition - is it correct to disable / enable the reflection?
Sorry my bad, I missed the fact that the error is on the SCAN operator so you are right, VDS will not help. I wanted to suggest upgrading and going via AWS Glue but this double to decimal will not be supported
Would you know for a fact that the ETL drops some Parquet files as DECIMAL and some as DOUBLE? Is it possible to change the ETL so you get only one type?
We are being able to use that dataset in Redshift (we use those columns as VARCHARs).
I’m sure there are no decimals - even Dremio attempts to read them as DECIMAL(30,0) - without digits because they just exceed the double maybe.
We have shorter values like “404764343499088250” (18 digits) and values like “5562457373202737227” (19 digits). I think the first one might fit into a “double” and for the second one it infers the type to “DECIMAL(30,0)”.
I can understand your problem, later this year we are coming with internal schema solution where you can define columns needed and their datatype.
Can you send us parquet-tools meta from a parquet file that has 18 digits and one from 19 digits?
Thanks for this - we figured it out - not Dremio related.
Just for the context - we are UNLOADing these parquet files from Redshift.
The schema was changed at some point on the table and the new files were written with DECIMAL(30,0)'s, where as initially they were indeed DOUBLE.
Sorry for the time lost on this non-issue @balaji.ramaswamy .