Unable to read Parquet footer with file generated with turbodbc

Hello again,

I exported 1 billion rows from Sql Server to parquet files using turbodbc. The files can be loaded with Pandas, but Dremio shows the error:

Unable to read Parquet footer

Is it a error in Dremio or maybe i did something wrong? Is necessary for parquet files to be special in some way to be readied by Dremio?

Hi,

Dremio should be able to read in standard Parquet files, so to understand what is going on could you provide a query profile (more information here)? Also, what version of Dremio are you running?

thanks,
Doron

The paquet files are not recognized as dataset, and when i try to load them as dataset, the error message is shown.

As the parquet files can’t be loaded as datasets, i can’t query it and there are no jobs. Maybe if i send you one of the files, very small, that’s can help: 003.zip (959 Bytes)

Or if you can tell me how i can to know if the turbodbc’s parquet files are standard…

But, pandas can read them without problems…

We’ll continue to work to get to the bottom of why Dremio can’t read those Parquet files.

But I thought it would be worth pointing out that Dremio does something similar to what you’re doing manually when you create a reflection:

https://docs.dremio.com/acceleration/reflections.html

The advantage being that Dremio will keep the reflections updated based on an SLA you specify.

Might not be relevant, but thought it was worth pointing out in case it might be.

1 Like

Thanks for the answer. In my case, i’m trying and comparing various solutions for the data lake in my workplace. I’m testing Dremio vs Azure Data Lake (Analitycs). I want parquet files because:

  • Parquet is readable from Hadoop, and the ecosystem of pandas.
  • It’s a compresed format. In my head, Dremio can be the Sql Layer to query old data and new data in a uniform way, as the same data. The idea is to move the old data from Sql Server to Parquet files, and remove the old data from Sql Server. Then, Dremio allows to query the data (old and new) as the same data source. I think, i’m testing Dremio to see if that it’s true.
  • If Dremio is not accepted by the team, the Parquet files can be used with other tools too.

I’m very new to these tools and maybe i’m misunderstanding things. Or there are better ways. I’ll continue testing.

P.D.: Dremio store the Arrow data from reflections in a compressed way?

I can reproduce the issue with your 003.parquet file, the error that we get is Required field 'codec' was not present! (you can see it as well in the Jobs page, you will have to filter for Internal jobs). We will have someone with more parquet knowledge look into this.

1 Like

Dremio persists Data Reflections as Parquet. Dremio’s Parquet reader is highly optimized and reads the data into Arrow in-memory buffers.

Your approach makes sense if you’re trying to ‘retire’ the legacy system. Data Reflections aren’t intended to be a long term persistence strategy. Instead, they are used to optimize query performance. For example, you could have Dremio maintain one or more Data Reflections that are sorted/partitioned/aggregated in different ways to optimize different query patterns. Users wouldn’t ever need to know about these Data Reflections - the query planner would automatically pick the best one for a given query.

A common use case is that the source isn’t being retired and will remain as the system of record. In this case, Data Reflections 1) optimize the analytical processing (most sources aren’t optimized for scan-heavy workloads), and 2) offloads the analytical workload from the source, so the SLA isn’t impacted.

You could still use Dremio to generate the Parquet files, even for long-term storage, by following the instructions in this thread: Converting JSON to Parquet

You might find this is 1) better performance than turbodbc (assuming you have a Dremio cluster), and 2) works with non-relational sources (eg, JSON).

1 Like

I see. Thanks you very much. I’ll try with dremio to extract more data from other tables as compressed parquet files. We need to free space from disks.

It looks like turbodbc is probably generating invalid Parquet files. Noting here the format:

This Parquet format requires a codec to be set but in your example file, the codec is not set:

Required field ‘codec’ was not present!
Struct: ColumnMetaData(type:INT64, encodings:[PLAIN, RLE], path_in_schema:[GLNPV], codec:null, num_values:23, total_uncompressed_size:232, total_compressed_size:103, data_page_offset:4, index_page_offset:0, statistics:Statistics(max:59 87 81 47 02 07 00 00, min:D7 86 81 47 02 07 00 00, null_count:0))

org.apache.parquet.format.ColumnMetaData.validate():1923

We’ll review further but I’m guessing this is the problem.

1 Like

Figured out the issue with help from the Parquet dev community. The file you’re producing with Turbodbc is using Brotli compression which was only introduced very recently in Parquet (format 2.3.2) and Dremio doesn’t support it yet. (You’d probably have a similar problem with most other tools as well. You should be able to tell Turbodbc to write using a more common compression format (such as Snappy).

I’ll file an enhancement request to keep track of this in Dremio.

1 Like

Ahhh! ok. Your help was invaluable. Thanks. I’ll try compressing with Snappy.

One last question… Dremio supports gzip compression in parquet files?

Yes, Dremio supports gzip in Parquet. Let us know if you have any issues.

1 Like