S3 folders with extreme amount of files at the top level (not partitioned parquet-style)

Hello. I’m very new to Dremio. I have this directory in S3 with millions of objects in it. AWS only allows you to page through 1000 keys at a time, so indexing it in any system takes a while. Dremio included.

I want to write my own small script that copies data from the flat folder in S3 that has millions of direct children to a parquet-style partitioned S3 data source folder. If i do this, and enable incremental data source updating strategy on the new data source, will manually triggered data source refreshes be fast? If so, what kind of latencies can I expect? Looking for general advice in this area, not necessarily specific approximations.

Thanks in advance.

Is there a preferred way to partition data and save new data to S3 in terms of efficiency for dataset refreshes?


The partition key depends on 2 things

  • Most common filter conditions in your query so we can effectively use partition pruning
  • Cardinality of the column, neither do not want to use a very high cardinality so we end up with too many files, nor too low a cardinality so we miss out in parallelism (# of splits)

If the column on which the FILTER is applied is a non-partitioned column then Dremio would do a filter push down instead of pruning. Currently filter pushdown, Dremio only does one

It is also very important that the ETL job does not create too many small files. In S3, one row group is a split and a split drives parallelism. If you there are too many splits then it can cause an over head in opening files, reading footers and closing files. On the contrary too less splits can cause the query to be under parallelized.

Let us know if you have any further questions