Support parquet files compressed via ZSTD

Is it possible to support parquet files compressed via ZSTD?

currently not supported. only snappy、gzip、none support

EnumeratedStringValidator PARQUET_WRITER_COMPRESSION_TYPE_VALIDATOR = new EnumeratedStringValidator(
      PARQUET_WRITER_COMPRESSION_TYPE, "snappy", "snappy", "gzip", "none");

Is there any way to add it?

if you want to support ZSTD, you should modify sabot/kernel/src/main/java/com/dremio/exec/store/parquet/ParquetRecordWriter.java

and also sabot/kernel/src/main/java/com/dremio/exec/ExecConstants.java

I modify kernel and compile one you can have a try GitHub - rongfengliang/dremio-parquet-zstd

Is it possible to merge it?

as far as I know, it seems impossible

We are currently doing it for being able to read ZStandard compressed parquet files

Basically, all that’s needed is adding two native libraries - a version of libhadoop.so that is compiled with ZSTD support and the ZSTD native library.

We are running Dremio using docker, so we do it like this:

---
version: '3'
services:
  executor-0:
    ...
    volumes:
      ...
      - /opt/dremio/lib/native/libhadoop.so.3.3.2:/opt/dremio/lib/libhadoop.so
      - /opt/dremio/lib/native/libzstd.so.1.5.2:/opt/dremio/lib/libzstd.so
      - /opt/dremio/lib/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar:/opt/dremio/jars/3rdparty/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar

One word of caution though. There’s a bug in the Dremio CE parquet reader library (which is free but the source isn’t open), which doesn’t release the decompressor after use. It doesn’t make a difference for Snappy, but the ZStandard decompresser uses native memory, which is then leaked. This has caused our executors to get killed by the OOM killer several times per day (~350GB memory leaked before doing so).
That’s why hadoop-common-3.3.2 is mapped in the above, in which I made a somewhat hacky fix that releases the native memory in a finalizer such that the leaking is somewhat contained. I have confirmed that fix contains the issue, so I’ll make a thread on it specifically (or maybe someone from Dremio will notice it here). If someone at Dremio doesn’t pick it up I’ll be doing a nicer and more permanent fix for it

2 Likes