Improve S3 Parquet mapping and metadata update


in my team we deal with a lot of parquet files (~100k files, ~250GB, increasing). The data is stored on a local disk as well on S3 but we always used the local disk since we run a single Dremio instance. We are now considering using the S3 as data source since we need multiple Dremio instances (on different environments, not multinode installations) to access the data, however we realized that the performances on S3 are terrible, especially when you first map the objects or update the metadata (10 minutes, compared to few seconds on disk).
We also gave a try to MinIO, slower than disk, but still much faster than AWS S3.
When we map S3 objects, all resources are not cap-ed (CPU, MEM, Network…).
We wonder why S3 discovery and metadata update is so slow compared to both disk and MinIO (that emulate the same S3 API) and if there are settings to tweak.

Hi @Luca,

We need to find out where the time is spent? Are your Dremio executors and coordinator on the same region as the S3 bucket? What is your stand alone machine configuration in terms of CPU/Memory?

Where is your rocksDB stored? on SSD?

Hi @balaji.ramaswamy ,
this is a single node configuration (the same installation is both the only executor and the only coordinator), on a 8 Core 32GB EC2 (m5.2xlarge), Dremio database is stored in an attached EBS (gp2), essentially an SSD limited to ~1000 IOPS. Everything in the same region.
Of course this is not a top configuration, however we conducted some tests and we found that:

  1. during the PDS creation, logs on AWS report a lot of unspecified 4XX errors and a sustained ~4K “listobject” and “head” requests per minute.
  2. creating PDS on S3 raises the network pps (packets per second) to only ~500 pps, while PDS on local MinIO generates a lot more (~6K) pps per second, on the same data copied locally on disk.
  3. During the PDS creation, RAM, CPU and Networking are never even close to their limits, looks like Dremio is doing nothing. So I suspect that is not really an hardware related problem, sounds more like some sort of overhead or latency in the S3 interface.

Thank you