Cannot read parquet file from Airbyte

I was doing a test run to read the Airbyte S3 parquet file using Dremio. So for that, I put a sample .parquest file in my S3 source folder and do a sync to another folder. Then I try to read both the source file and copied file using Dremio from S3.

The source file is readable and the airbyte copied file is not readable. From Dremio logs I got this error:
Unable to coerce from the file’s data type “timestamp” to the column’s data type “int64” in table “2022_07_11_1657578941501_0. parquet”, column “_ab_source_file_last_modified.member0”

Why the parquet file created by Airebyte is not readable**

I tried to read it from a python script notebook and it is readable. This is the schema details I got from pyarrow.parquet

_airbyte_ab_id: string not null
_airbyte_emitted_at: timestamp[ms, tz=UTC] not null
extra: double
mta_tax: double
VendorID: int32
ehail_fee: int32
trip_type: double
RatecodeID: double
tip_amount: double
fare_amount: double
DOLocationID: int32
PULocationID: int32
payment_type: double
tolls_amount: double
total_amount: double
trip_distance: double
passenger_count: double
store_and_fwd_flag: string
_ab_source_file_url: string
congestion_surcharge: double
lpep_pickup_datetime: string
improvement_surcharge: double
lpep_dropoff_datetime: string
_ab_source_file_last_modified: struct<member0: timestamp[us, tz=UTC], member1: string>
  child 0, member0: timestamp[us, tz=UTC]
  child 1, member1: string
_airbyte_additional_properties: map<string, string ('_airbyte_additional_properties')>
  child 0, _airbyte_additional_properties: struct<key: string not null, value: string not null> not null
      child 0, key: string not null
      child 1, value: string not null
-- schema metadata --
parquet.avro.schema: '{"type":"record","name":"s3_taxi_data","fields":[{"' + 1750
writer.model.name: 'avro'

2022_07_11_1657578941501_0.parquet.zip (2.9 MB)

Dremio version

Build : 22.0.0-202206221430090603-1fa4049f
Edition : Community Edition

Attached the file. Any help is appreciated.

@sarathmaf could you provide examples of SQL commands you are running to sync/query?

It’s just a copy of one file to another via airbyte with no sql. Only the extra fields get added to the new file, ie all columns start with _ab.
The original file has the schema like (from python script)

VendorID: int64
lpep_pickup_datetime: timestamp[us]
lpep_dropoff_datetime: timestamp[us]
store_and_fwd_flag: string
RatecodeID: double
PULocationID: int64
DOLocationID: int64
passenger_count: double
trip_distance: double
fare_amount: double
extra: double
mta_tax: double
tip_amount: double
tolls_amount: double
ehail_fee: null
improvement_surcharge: double
total_amount: double
payment_type: double
trip_type: double
congestion_surcharge: double
-- schema metadata --

The problem is reading

_ab_source_file_last_modified: struct<member0: timestamp[us, tz=UTC], member1: string>
  child 0, member0: timestamp[us, tz=UTC]
  child 1, member1: string

Indeed, I can reproduce.
parquet-tools tells me this column is TIMESTAMP_MICROS and Google search tells me that various products have (or used to have) issues with this type.
We’ll discuss this with the team and in the meantime you should be able to query the file by skipping _ab_source_file_last_modified column.

_ab_source_file_last_modified:  OPTIONAL F:2 
.member0:                       OPTIONAL INT64 L:TIMESTAMP(MICROS,true) R:0 D:2
1 Like

Thanks for the revert.
Is there any way to skip the column from the UI itself in community edition ? Or via API or any other means.

Or I can simply do a select [] from file where I can skip that field instead of select *

@sarathmaf Dremio does automatic internal coercion, but in Dremio 21.0 there is a new feature called internal schema where you can define a column’s data type via SQL

https://docs.dremio.com/software/sql-reference/sql-commands/tables/#alter-table

1 Like