Dremio fails to read Parquet files that have zero rows

Hi,

We are moving to Dremio for reading parquet files as it is a wonderful tool. That said
when running with the latest version 4.18 it will error when reading parquet files with zero rows.

I saw this was fixed in the 3.2 release notes. I tried running version 3.2 and the problem went away.

Has anyone else seen this?

Thanks
Wayne

@waynekoepcke

Are all Parquet files zero rows? The fix in 3.2 was to read a folder of Parquet files in which there is a mix of zero rows and non-zero rows Parquet files. We will still error out if all Parquet files are zero rows. Can you please confirm ? Also kindly send us the profile of the failed job

Thanks
Bali

Hi Bali,

This happens when reading a folder that contains a mix of zero row and non-zero row parquet files.
It only takes 1 zero row parquet file to cause the problem.

Attached below is the profile you requested. Thanks for the help!
Wayne

50ac81bf-d1e9-405b-969a-37e0e6a0dddc.zip (12.3 KB)

@waynekoepcke,

Is it OK to provide sample parquet files so that we can check it internally?

Hi Ye,

Sure let me create some data you can use.

Thanks
Wayne

Hi Ye,

The attached zipfile contains 2 parquet files. One empty and one with data.
I created the parquet files using Apache Drill with these CTAS statements
create table dfs.parquet.sample/20200426 as select ‘SampleUserName’ as name, 100 as memberid
create table dfs.parquet.sample/20200427 as select name,memberid from oracle.PRODUCTION.MEMBER where name = ‘does not exist’

In a clustered environment this will produce the error if I restart the coordinator. If I also restart the executors the problem will go away.

Procedure

  1. Create the 1st file (The one containing data)
  2. Import the dataset
  3. Create the 2nd file
  4. Restart the coordinator (Clustered environment)
  5. Run a query on the sample folder (You will see the problem)

sample.zip (1.7 KB)

Hello @balaji.ramaswamy I’ve similar problem but with reflections. I’ve a raw reflection applied to VDS but when this VDS have 0 rows the reflection shows are unavailable and querys push downs to datasource

@dacopan

Are you seeing the reflection data has zero rows, we then mark it as invalid but if you know the source has data now, you should be able to refresh reflection and then run the query

Thanks
Bali