Dremio fails to read Parquet files that have zero rows

waynekoepcke · April 25, 2020, 12:39am

Hi,

We are moving to Dremio for reading parquet files as it is a wonderful tool. That said
when running with the latest version 4.18 it will error when reading parquet files with zero rows.

I saw this was fixed in the 3.2 release notes. I tried running version 3.2 and the problem went away.

Has anyone else seen this?

Thanks
Wayne

balaji.ramaswamy · April 27, 2020, 6:08am

@waynekoepcke

Are all Parquet files zero rows? The fix in 3.2 was to read a folder of Parquet files in which there is a mix of zero rows and non-zero rows Parquet files. We will still error out if all Parquet files are zero rows. Can you please confirm ? Also kindly send us the profile of the failed job

Thanks
Bali

waynekoepcke · April 27, 2020, 4:40pm

Hi Bali,

This happens when reading a folder that contains a mix of zero row and non-zero row parquet files.
It only takes 1 zero row parquet file to cause the problem.

Attached below is the profile you requested. Thanks for the help!
Wayne

50ac81bf-d1e9-405b-969a-37e0e6a0dddc.zip (12.3 KB)

Ye_Li · April 27, 2020, 5:33pm

@waynekoepcke,

Is it OK to provide sample parquet files so that we can check it internally?

waynekoepcke · April 27, 2020, 6:48pm

Hi Ye,

Sure let me create some data you can use.

Thanks
Wayne

waynekoepcke · April 28, 2020, 12:17am

Hi Ye,

The attached zipfile contains 2 parquet files. One empty and one with data.
I created the parquet files using Apache Drill with these CTAS statements
create table dfs.parquet.sample/20200426 as select ‘SampleUserName’ as name, 100 as memberid
create table dfs.parquet.sample/20200427 as select name,memberid from oracle.PRODUCTION.MEMBER where name = ‘does not exist’

In a clustered environment this will produce the error if I restart the coordinator. If I also restart the executors the problem will go away.

Procedure

Create the 1st file (The one containing data)
Import the dataset
Create the 2nd file
Restart the coordinator (Clustered environment)
Run a query on the sample folder (You will see the problem)

sample.zip (1.7 KB)

dacopan · April 30, 2020, 1:53pm

Hello @balaji.ramaswamy I’ve similar problem but with reflections. I’ve a raw reflection applied to VDS but when this VDS have 0 rows the reflection shows are unavailable and querys push downs to datasource

balaji.ramaswamy · April 30, 2020, 6:18pm

@dacopan

Are you seeing the reflection data has zero rows, we then mark it as invalid but if you know the source has data now, you should be able to refresh reflection and then run the query

Thanks
Bali

Topic		Replies	Views
Folder not taken into account in a parquet dataset	6	2230	October 15, 2019
Dremio Parquet Reader	4	2249	November 20, 2018
Dremio can not read Parquet produced by Arrow	0	1468	March 27, 2020
Regression in parquet reader in version 3.3.1	27	3201	October 1, 2019
Parquet metadata error - is Parquet v2.0 file format supported?	3	4099	January 28, 2019

Dremio fails to read Parquet files that have zero rows

Related topics