Basically, all that’s needed is adding two native libraries - a version of libhadoop.so that is compiled with ZSTD support and the ZSTD native library.
We are running Dremio using docker, so we do it like this:
---
version: '3'
services:
executor-0:
...
volumes:
...
- /opt/dremio/lib/native/libhadoop.so.3.3.2:/opt/dremio/lib/libhadoop.so
- /opt/dremio/lib/native/libzstd.so.1.5.2:/opt/dremio/lib/libzstd.so
- /opt/dremio/lib/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar:/opt/dremio/jars/3rdparty/hadoop-common-3.3.2-dremio-202207041927090255-61c2bd1.jar
One word of caution though. There’s a bug in the Dremio CE parquet reader library (which is free but the source isn’t open), which doesn’t release the decompressor after use. It doesn’t make a difference for Snappy, but the ZStandard decompresser uses native memory, which is then leaked. This has caused our executors to get killed by the OOM killer several times per day (~350GB memory leaked before doing so).
That’s why hadoop-common-3.3.2
is mapped in the above, in which I made a somewhat hacky fix that releases the native memory in a finalizer such that the leaking is somewhat contained. I have confirmed that fix contains the issue, so I’ll make a thread on it specifically (or maybe someone from Dremio will notice it here). If someone at Dremio doesn’t pick it up I’ll be doing a nicer and more permanent fix for it