Hello,
I am working with the latest Dremio Docker image, dremio/dremio-oss:latest, which looks to be version 14. I am using Dremio to connect to an Azure Blob Storage v2, which has a folder with Parquet files. I can build out a connection to ADLS and set up a dataset with the Parquet files. This works without any issue. However, I am not able to use any window functions in my queries. The two window functions I used were row_number() and count(), and both resulted in a UnsupportedOperationException error.
The exact error was as follows:
SYSTEM ERROR: UnsupportedOperationException
SqlOperatorImpl HASH_PARTITION_SENDER
Location 2:0:0
SqlOperatorImpl HASH_PARTITION_SENDER
Location 2:0:0
Fragment 2:0
[Error Id: d8f0553c-8b83-45dd-a2be-dc7176c4485a on 15da02e03768:0]
(java.lang.UnsupportedOperationException) null
org.apache.arrow.vector.complex.UnionVector.getValidityBufferAddress():241
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.():63
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.():41
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier$GenericCopier.():372
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.addValueCopier():449
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.getCopiers():477
com.dremio.sabot.op.sender.partition.vectorized.VectorizedPartitionSenderOperator.setup():185
com.dremio.sabot.driver.SmartOp$SmartTerminal.setup():346
com.dremio.sabot.driver.Pipe$SetupVisitor.visitTerminalOperator():85
com.dremio.sabot.driver.Pipe$SetupVisitor.visitTerminalOperator():63
com.dremio.sabot.driver.SmartOp$SmartTerminal.accept():290
com.dremio.sabot.driver.StraightPipe.setup():103
com.dremio.sabot.driver.Pipeline.setup():68
com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():392
com.dremio.sabot.exec.fragment.FragmentExecutor.run():273
com.dremio.sabot.exec.fragment.FragmentExecutor.access$1400():94
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():712
com.dremio.sabot.task.AsyncTaskWrapper.run():112
com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():228
com.dremio.sabot.task.slicing.SlicingThread.run():159
I also tried version 13.1 and encountered the same error.
Is it possible to issue window functions to Parquet files on ADLS?
@kentmaxwell
This is a known issue that we have an internal ticket for, I will add this to the internal ticket to help prioritize fixing the issue
Thanks
Bali
@kentmaxwell
I just heard back from engineering that this is due to data having mixed types, Dremio has deprecated support for mixed types, so couple of choices
- Clean the lake so we have one type for that columns
- Cast the column to a certain type, save as VDS and query the VDS that has the cast
@balaji.ramaswamy
Thank you for responding so fast. I do not agree this is a problem of mixed types. In my attempt to use window functions, I was doing so against a single parquet file in an ADLS blob location. The schema of the file is:
Schema:
id: int64 PARQUET:field_id: β1β
first_name: string PARQUET:field_id: β2β
last_name: string PARQUET:field_id: β3β
birth_date: date32[day] PARQUET:field_id: β4β
gender: string PARQUET:field_id: β5β
modified_date: timestamp[us] PARQUET:field_id: β6β
execution_id: string PARQUET:field_id: β7β
filename: string PARQUET:field_id: β8β
I was attempting to do a ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified_date). I can share this Parquet file if you want, it is only 2.3 MB and is completely test data.
I attempted the same window function on a Parquet file with the same schema, but only 24 rows, and the function worked. It seems the nature of this issue is in the length of the file, not the schema of the file.
@kentmaxwell
That would be super useful to reproduce the issue and see if we are hitting a bug, are you able to provide us with the Parquet file?
@balaji.ramaswamy
No problem. The Parquet file is downloadable from this URL: Upload files for free - people_2021-03-19_205719804817.parquet - ufile.io
Please confirm that you were able to download and use the file with success.
Thanks!
@kentmaxwell
Yes, i was able to download the file, for some reason Dremio marks βmodified_dateβ as a mixed type. Let us investigate and get back to you
Would you have the query that you are running on this Parquet dataset?
Thanks
Bali
@balaji.ramaswamy
As a new user of Dremio, I completely missed the fact that modified_date comes up as a mixed type. I was able to get the query I wanted to use to work by coercing the value to an integer. The query I was trying to run was
SELECT
row_number() over (partition by id order modified_date) as revision,
*
FROM people
The metadata in the Parquet file shows this is a timestamp[us] β Why would this value be interpreted as a mixed type? What other type could it be other than integer?
I have run queries against this exact dataset using the Polybase/Hadoop bridge connector MS provides in SQL Server. I am not saying that mechanism is better. It is not near as sophisticated as Dremio. However, it did not require a cast to enable querying on this data type.
@kentmaxwell
Agree with you, for now please use this workaround, we will investigate on why on Dremio it is a mixed type
1 Like
Hi @balaji.ramaswamy
Thank you for the response. I have a related question if you have an opportunity to provide an answer. As I indicated, the data type for the modified_date is timestamp[us]. What would be the best way to convert this into a DateTime value in Dremio?
Thanks!
@kentmaxwell
Can you please try cast( as TIMESTAMP)?