Window Functions throw UnsupportedOperationException with ADLS + Parquet Files

Hello,

I am working with the latest Dremio Docker image, dremio/dremio-oss:latest, which looks to be version 14. I am using Dremio to connect to an Azure Blob Storage v2, which has a folder with Parquet files. I can build out a connection to ADLS and set up a dataset with the Parquet files. This works without any issue. However, I am not able to use any window functions in my queries. The two window functions I used were row_number() and count(), and both resulted in a UnsupportedOperationException error.

The exact error was as follows:
SYSTEM ERROR: UnsupportedOperationException

SqlOperatorImpl HASH_PARTITION_SENDER
Location 2:0:0
SqlOperatorImpl HASH_PARTITION_SENDER
Location 2:0:0
Fragment 2:0

[Error Id: d8f0553c-8b83-45dd-a2be-dc7176c4485a on 15da02e03768:0]

(java.lang.UnsupportedOperationException) null
org.apache.arrow.vector.complex.UnionVector.getValidityBufferAddress():241
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.():63
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.():41
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier$GenericCopier.():372
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.addValueCopier():449
com.dremio.sabot.op.sender.partition.vectorized.MultiDestCopier.getCopiers():477
com.dremio.sabot.op.sender.partition.vectorized.VectorizedPartitionSenderOperator.setup():185
com.dremio.sabot.driver.SmartOp$SmartTerminal.setup():346
com.dremio.sabot.driver.Pipe$SetupVisitor.visitTerminalOperator():85
com.dremio.sabot.driver.Pipe$SetupVisitor.visitTerminalOperator():63
com.dremio.sabot.driver.SmartOp$SmartTerminal.accept():290
com.dremio.sabot.driver.StraightPipe.setup():103
com.dremio.sabot.driver.Pipeline.setup():68
com.dremio.sabot.exec.fragment.FragmentExecutor.setupExecution():392
com.dremio.sabot.exec.fragment.FragmentExecutor.run():273
com.dremio.sabot.exec.fragment.FragmentExecutor.access$1400():94
com.dremio.sabot.exec.fragment.FragmentExecutor$AsyncTaskImpl.run():712
com.dremio.sabot.task.AsyncTaskWrapper.run():112
com.dremio.sabot.task.slicing.SlicingThread.mainExecutionLoop():228
com.dremio.sabot.task.slicing.SlicingThread.run():159

I also tried version 13.1 and encountered the same error.

Is it possible to issue window functions to Parquet files on ADLS?

@kentmaxwell

This is a known issue that we have an internal ticket for, I will add this to the internal ticket to help prioritize fixing the issue

Thanks
Bali

@kentmaxwell

I just heard back from engineering that this is due to data having mixed types, Dremio has deprecated support for mixed types, so couple of choices

  • Clean the lake so we have one type for that columns
  • Cast the column to a certain type, save as VDS and query the VDS that has the cast

@balaji.ramaswamy

Thank you for responding so fast. I do not agree this is a problem of mixed types. In my attempt to use window functions, I was doing so against a single parquet file in an ADLS blob location. The schema of the file is:
Schema:
id: int64 PARQUET:field_id: β€˜1’
first_name: string PARQUET:field_id: β€˜2’
last_name: string PARQUET:field_id: β€˜3’
birth_date: date32[day] PARQUET:field_id: β€˜4’
gender: string PARQUET:field_id: β€˜5’
modified_date: timestamp[us] PARQUET:field_id: β€˜6’
execution_id: string PARQUET:field_id: β€˜7’
filename: string PARQUET:field_id: β€˜8’

I was attempting to do a ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified_date). I can share this Parquet file if you want, it is only 2.3 MB and is completely test data.

I attempted the same window function on a Parquet file with the same schema, but only 24 rows, and the function worked. It seems the nature of this issue is in the length of the file, not the schema of the file.

@kentmaxwell

That would be super useful to reproduce the issue and see if we are hitting a bug, are you able to provide us with the Parquet file?

@balaji.ramaswamy

No problem. The Parquet file is downloadable from this URL: Upload files for free - people_2021-03-19_205719804817.parquet - ufile.io

Please confirm that you were able to download and use the file with success.

Thanks!

@kentmaxwell

Yes, i was able to download the file, for some reason Dremio marks β€œmodified_date” as a mixed type. Let us investigate and get back to you

Would you have the query that you are running on this Parquet dataset?

Thanks
Bali

@balaji.ramaswamy

As a new user of Dremio, I completely missed the fact that modified_date comes up as a mixed type. I was able to get the query I wanted to use to work by coercing the value to an integer. The query I was trying to run was

SELECT
row_number() over (partition by id order modified_date) as revision,
*
FROM people

The metadata in the Parquet file shows this is a timestamp[us] – Why would this value be interpreted as a mixed type? What other type could it be other than integer?

I have run queries against this exact dataset using the Polybase/Hadoop bridge connector MS provides in SQL Server. I am not saying that mechanism is better. It is not near as sophisticated as Dremio. However, it did not require a cast to enable querying on this data type.

@kentmaxwell

Agree with you, for now please use this workaround, we will investigate on why on Dremio it is a mixed type

1 Like

Hi @balaji.ramaswamy

Thank you for the response. I have a related question if you have an opportunity to provide an answer. As I indicated, the data type for the modified_date is timestamp[us]. What would be the best way to convert this into a DateTime value in Dremio?

Thanks!

@kentmaxwell

Can you please try cast( as TIMESTAMP)?