I am working with the latest Dremio Docker image, dremio/dremio-oss:latest, which looks to be version 14. I am using Dremio to connect to an Azure Blob Storage v2, which has a folder with Parquet files. I can build out a connection to ADLS and set up a dataset with the Parquet files. This works without any issue. However, I am not able to use any window functions in my queries. The two window functions I used were row_number() and count(), and both resulted in a UnsupportedOperationException error.
The exact error was as follows:
SYSTEM ERROR: UnsupportedOperationException
[Error Id: d8f0553c-8b83-45dd-a2be-dc7176c4485a on 15da02e03768:0]
I also tried version 13.1 and encountered the same error.
Is it possible to issue window functions to Parquet files on ADLS?
This is a known issue that we have an internal ticket for, I will add this to the internal ticket to help prioritize fixing the issue
I just heard back from engineering that this is due to data having mixed types, Dremio has deprecated support for mixed types, so couple of choices
- Clean the lake so we have one type for that columns
- Cast the column to a certain type, save as VDS and query the VDS that has the cast
Thank you for responding so fast. I do not agree this is a problem of mixed types. In my attempt to use window functions, I was doing so against a single parquet file in an ADLS blob location. The schema of the file is:
id: int64 PARQUET:field_id: ‘1’
first_name: string PARQUET:field_id: ‘2’
last_name: string PARQUET:field_id: ‘3’
birth_date: date32[day] PARQUET:field_id: ‘4’
gender: string PARQUET:field_id: ‘5’
modified_date: timestamp[us] PARQUET:field_id: ‘6’
execution_id: string PARQUET:field_id: ‘7’
filename: string PARQUET:field_id: ‘8’
I was attempting to do a ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified_date). I can share this Parquet file if you want, it is only 2.3 MB and is completely test data.
I attempted the same window function on a Parquet file with the same schema, but only 24 rows, and the function worked. It seems the nature of this issue is in the length of the file, not the schema of the file.
That would be super useful to reproduce the issue and see if we are hitting a bug, are you able to provide us with the Parquet file?
No problem. The Parquet file is downloadable from this URL: Upload files for free - people_2021-03-19_205719804817.parquet - ufile.io
Please confirm that you were able to download and use the file with success.
Yes, i was able to download the file, for some reason Dremio marks “modified_date” as a mixed type. Let us investigate and get back to you
Would you have the query that you are running on this Parquet dataset?
As a new user of Dremio, I completely missed the fact that modified_date comes up as a mixed type. I was able to get the query I wanted to use to work by coercing the value to an integer. The query I was trying to run was
row_number() over (partition by id order modified_date) as revision,
The metadata in the Parquet file shows this is a timestamp[us] – Why would this value be interpreted as a mixed type? What other type could it be other than integer?
I have run queries against this exact dataset using the Polybase/Hadoop bridge connector MS provides in SQL Server. I am not saying that mechanism is better. It is not near as sophisticated as Dremio. However, it did not require a cast to enable querying on this data type.
Agree with you, for now please use this workaround, we will investigate on why on Dremio it is a mixed type
Thank you for the response. I have a related question if you have an opportunity to provide an answer. As I indicated, the data type for the modified_date is timestamp[us]. What would be the best way to convert this into a DateTime value in Dremio?
Can you please try cast( as TIMESTAMP)?