Support parquet files compressed via ZSTD

wundi · October 27, 2022, 5:24pm

Basically, all that’s needed is adding two native libraries - a version of libhadoop.so that is compiled with ZSTD support and the ZSTD native library.

We are running Dremio using docker, so we do it like this:

---
version: '3'
services:
  executor-0:
    ...
    volumes:
      ...
      - /opt/dremio/lib/native/libhadoop.so.3.3.2:/opt/dremio/lib/libhadoop.so
      - /opt/dremio/lib/native/libzstd.so.1.5.2:/opt/dremio/lib/libzstd.so
      - /opt/dremio/lib/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar:/opt/dremio/jars/3rdparty/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar

One word of caution though. There’s a bug in the Dremio CE parquet reader library (which is free but the source isn’t open), which doesn’t release the decompressor after use. It doesn’t make a difference for Snappy, but the ZStandard decompresser uses native memory, which is then leaked. This has caused our executors to get killed by the OOM killer several times per day (~350GB memory leaked before doing so).
That’s why hadoop-common-3.3.2 is mapped in the above, in which I made a somewhat hacky fix that releases the native memory in a finalizer such that the leaking is somewhat contained. I have confirmed that fix contains the issue, so I’ll make a thread on it specifically (or maybe someone from Dremio will notice it here). If someone at Dremio doesn’t pick it up I’ll be doing a nicer and more permanent fix for it

Topic		Replies	Views
Support for Parquet version 2 and Brotli compression	2	1389	March 24, 2022
Regression in parquet reader in version 3.3.1	27	3194	October 1, 2019
Able to read parquet file with parquet-tools, but not dremio	11	3946	August 15, 2019
Unable to read parquet file generated by python	7	2226	July 22, 2019
Dremio CE parquet reader causes native memory leak	10	1643	September 7, 2023

Support parquet files compressed via ZSTD

Related topics