Unable to read Parquet footer with file generated with turbodbc

gabomgp · November 20, 2017, 12:11am

Hello again,

I exported 1 billion rows from Sql Server to parquet files using turbodbc. The files can be loaded with Pandas, but Dremio shows the error:

Unable to read Parquet footer

Is it a error in Dremio or maybe i did something wrong? Is necessary for parquet files to be special in some way to be readied by Dremio?

doron · November 20, 2017, 12:46am

Hi,

Dremio should be able to read in standard Parquet files, so to understand what is going on could you provide a query profile (more information here)? Also, what version of Dremio are you running?

thanks,
Doron

gabomgp · November 20, 2017, 1:08am

The paquet files are not recognized as dataset, and when i try to load them as dataset, the error message is shown.

As the parquet files can’t be loaded as datasets, i can’t query it and there are no jobs. Maybe if i send you one of the files, very small, that’s can help: 003.zip (959 Bytes)

Or if you can tell me how i can to know if the turbodbc’s parquet files are standard…

But, pandas can read them without problems…

kelly · November 20, 2017, 1:12am

We’ll continue to work to get to the bottom of why Dremio can’t read those Parquet files.

But I thought it would be worth pointing out that Dremio does something similar to what you’re doing manually when you create a reflection:

https://docs.dremio.com/acceleration/reflections.html

The advantage being that Dremio will keep the reflections updated based on an SLA you specify.

Might not be relevant, but thought it was worth pointing out in case it might be.

gabomgp · November 20, 2017, 1:24am

Thanks for the answer. In my case, i’m trying and comparing various solutions for the data lake in my workplace. I’m testing Dremio vs Azure Data Lake (Analitycs). I want parquet files because:

Parquet is readable from Hadoop, and the ecosystem of pandas.
It’s a compresed format. In my head, Dremio can be the Sql Layer to query old data and new data in a uniform way, as the same data. The idea is to move the old data from Sql Server to Parquet files, and remove the old data from Sql Server. Then, Dremio allows to query the data (old and new) as the same data source. I think, i’m testing Dremio to see if that it’s true.
If Dremio is not accepted by the team, the Parquet files can be used with other tools too.

I’m very new to these tools and maybe i’m misunderstanding things. Or there are better ways. I’ll continue testing.

P.D.: Dremio store the Arrow data from reflections in a compressed way?

doron · November 20, 2017, 1:40am

I can reproduce the issue with your 003.parquet file, the error that we get is Required field 'codec' was not present! (you can see it as well in the Jobs page, you will have to filter for Internal jobs). We will have someone with more parquet knowledge look into this.

kelly · November 20, 2017, 1:46am

Dremio persists Data Reflections as Parquet. Dremio’s Parquet reader is highly optimized and reads the data into Arrow in-memory buffers.

Your approach makes sense if you’re trying to ‘retire’ the legacy system. Data Reflections aren’t intended to be a long term persistence strategy. Instead, they are used to optimize query performance. For example, you could have Dremio maintain one or more Data Reflections that are sorted/partitioned/aggregated in different ways to optimize different query patterns. Users wouldn’t ever need to know about these Data Reflections - the query planner would automatically pick the best one for a given query.

A common use case is that the source isn’t being retired and will remain as the system of record. In this case, Data Reflections 1) optimize the analytical processing (most sources aren’t optimized for scan-heavy workloads), and 2) offloads the analytical workload from the source, so the SLA isn’t impacted.

You could still use Dremio to generate the Parquet files, even for long-term storage, by following the instructions in this thread: Converting JSON to Parquet

You might find this is 1) better performance than turbodbc (assuming you have a Dremio cluster), and 2) works with non-relational sources (eg, JSON).

gabomgp · November 20, 2017, 2:00am

I see. Thanks you very much. I’ll try with dremio to extract more data from other tables as compressed parquet files. We need to free space from disks.

jacques · November 20, 2017, 4:29pm

It looks like turbodbc is probably generating invalid Parquet files. Noting here the format:

github.com

apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L630


      
          /** Number of values, including NULLs, in this data page. **/
          1: required i32 num_values
          /** Number of NULL values, in this data page.
              Number of non-null = num_values - num_nulls which is also the number of values in the data section **/
          2: required i32 num_nulls
          /** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/
          3: required i32 num_rows
          /** Encoding used for data in this page **/
          4: required Encoding encoding
          
          // repetition levels and definition levels are always using RLE (without size in it)
          
          /** Length of the definition levels */
          5: required i32 definition_levels_byte_length;
          /** Length of the repetition levels */
          6: required i32 repetition_levels_byte_length;
          
          /**  Whether the values are compressed.
          Which means the section of the page between
          definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included)
          is compressed with the compression_codec.

This Parquet format requires a codec to be set but in your example file, the codec is not set:

Required field ‘codec’ was not present!
Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[GLNPV], codec:null, num_values:23, total_uncompressed_size:232, total_compressed_size:103, data_page_offset:4, index_page_offset:0, statistics:Statistics(max:59 87 81 47 02 07 00 00, min:D7 86 81 47 02 07 00 00, null_count:0))

org.apache.parquet.format.ColumnMetaData.validate():1923

We’ll review further but I’m guessing this is the problem.

jacques · November 20, 2017, 4:47pm

Figured out the issue with help from the Parquet dev community. The file you’re producing with Turbodbc is using Brotli compression which was only introduced very recently in Parquet (format 2.3.2) and Dremio doesn’t support it yet. (You’d probably have a similar problem with most other tools as well. You should be able to tell Turbodbc to write using a more common compression format (such as Snappy).

I’ll file an enhancement request to keep track of this in Dremio.

gabomgp · November 20, 2017, 11:11pm

Ahhh! ok. Your help was invaluable. Thanks. I’ll try compressing with Snappy.

gabomgp · November 20, 2017, 11:18pm

One last question… Dremio supports gzip compression in parquet files?

kelly · November 21, 2017, 9:02pm

Yes, Dremio supports gzip in Parquet. Let us know if you have any issues.

Topic		Replies	Views
Unable to read parquet file generated by python	7	2229	July 22, 2019
Able to read parquet file with parquet-tools, but not dremio	11	3955	August 15, 2019
Dremio can not read Parquet produced by Arrow	0	1466	March 27, 2020
Regression in parquet reader in version 3.3.1	27	3196	October 1, 2019
Dremio not returning records from a parquet file	2	585	September 18, 2023

Unable to read Parquet footer with file generated with turbodbc

Related topics