Cannot read parquet file from Airbyte

sarathmaf · July 12, 2022, 9:53pm

I was doing a test run to read the Airbyte S3 parquet file using Dremio. So for that, I put a sample .parquest file in my S3 source folder and do a sync to another folder. Then I try to read both the source file and copied file using Dremio from S3.

The source file is readable and the airbyte copied file is not readable. From Dremio logs I got this error:
Unable to coerce from the file’s data type “timestamp” to the column’s data type “int64” in table “2022_07_11_1657578941501_0. parquet”, column “_ab_source_file_last_modified.member0”

Why the parquet file created by Airebyte is not readable**

I tried to read it from a python script notebook and it is readable. This is the schema details I got from pyarrow.parquet

_airbyte_ab_id: string not null
_airbyte_emitted_at: timestamp[ms, tz=UTC] not null
extra: double
mta_tax: double
VendorID: int32
ehail_fee: int32
trip_type: double
RatecodeID: double
tip_amount: double
fare_amount: double
DOLocationID: int32
PULocationID: int32
payment_type: double
tolls_amount: double
total_amount: double
trip_distance: double
passenger_count: double
store_and_fwd_flag: string
_ab_source_file_url: string
congestion_surcharge: double
lpep_pickup_datetime: string
improvement_surcharge: double
lpep_dropoff_datetime: string
_ab_source_file_last_modified: struct<member0: timestamp[us, tz=UTC], member1: string>
  child 0, member0: timestamp[us, tz=UTC]
  child 1, member1: string
_airbyte_additional_properties: map<string, string ('_airbyte_additional_properties')>
  child 0, _airbyte_additional_properties: struct<key: string not null, value: string not null> not null
      child 0, key: string not null
      child 1, value: string not null
-- schema metadata --
parquet.avro.schema: '{"type":"record","name":"s3_taxi_data","fields":[{"' + 1750
writer.model.name: 'avro'

2022_07_11_1657578941501_0.parquet.zip (2.9 MB)

Dremio version

Build : 22.0.0-202206221430090603-1fa4049f
Edition : Community Edition

Attached the file. Any help is appreciated.

dch · July 12, 2022, 10:19pm

@sarathmaf could you provide examples of SQL commands you are running to sync/query?

sarathmaf · July 12, 2022, 10:27pm

It’s just a copy of one file to another via airbyte with no sql. Only the extra fields get added to the new file, ie all columns start with _ab.
The original file has the schema like (from python script)

VendorID: int64
lpep_pickup_datetime: timestamp[us]
lpep_dropoff_datetime: timestamp[us]
store_and_fwd_flag: string
RatecodeID: double
PULocationID: int64
DOLocationID: int64
passenger_count: double
trip_distance: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
ehail_fee: null
improvement_surcharge: double
total_amount: double
payment_type: double
trip_type: double
congestion_surcharge: double
-- schema metadata --

sarathmaf · July 12, 2022, 10:30pm

The problem is reading

_ab_source_file_last_modified: struct<member0: timestamp[us, tz=UTC], member1: string>
  child 0, member0: timestamp[us, tz=UTC]
  child 1, member1: string

dch · July 13, 2022, 6:40am

Indeed, I can reproduce.
parquet-tools tells me this column is TIMESTAMP_MICROS and Google search tells me that various products have (or used to have) issues with this type.
We’ll discuss this with the team and in the meantime you should be able to query the file by skipping _ab_source_file_last_modified column.

_ab_source_file_last_modified:  OPTIONAL F:2 
.member0:                       OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:2

sarathmaf · July 13, 2022, 8:04am

Thanks for the revert.
Is there any way to skip the column from the UI itself in community edition ? Or via API or any other means.

sarathmaf · July 13, 2022, 8:57am

Or I can simply do a select [] from file where I can skip that field instead of select *

balaji.ramaswamy · July 13, 2022, 3:19pm

@sarathmaf Dremio does automatic internal coercion, but in Dremio 21.0 there is a new feature called internal schema where you can define a column’s data type via SQL

https://docs.dremio.com/software/sql-reference/sql-commands/tables/#alter-table

Topic		Replies	Views
BIGINT to TIMESTAMP coercion reading from Parquet files	4	152	December 16, 2024
Dremio can not read Parquet produced by Arrow	0	1466	March 27, 2020
Dremio not returning records from a parquet file	2	585	September 18, 2023
Unable to read Parquet footer with file generated with turbodbc	12	8094	November 21, 2017
Parquet datasets/files created by CTAS (CREATE table as Select) are not readable by pyarrow	12	5157	May 21, 2019

Cannot read parquet file from Airbyte

Related topics