S3 folders with extreme amount of files at the top level (not partitioned parquet-style)

lopatron · November 2, 2020, 11:13pm

Hello. I’m very new to Dremio. I have this directory in S3 with millions of objects in it. AWS only allows you to page through 1000 keys at a time, so indexing it in any system takes a while. Dremio included.

I want to write my own small script that copies data from the flat folder in S3 that has millions of direct children to a parquet-style partitioned S3 data source folder. If i do this, and enable incremental data source updating strategy on the new data source, will manually triggered data source refreshes be fast? If so, what kind of latencies can I expect? Looking for general advice in this area, not necessarily specific approximations.

Thanks in advance.

Edit:
Is there a preferred way to partition data and save new data to S3 in terms of efficiency for dataset refreshes?

balaji.ramaswamy · November 3, 2020, 7:37am

@lopatron

The partition key depends on 2 things

Most common filter conditions in your query so we can effectively use partition pruning
Cardinality of the column, neither do not want to use a very high cardinality so we end up with too many files, nor too low a cardinality so we miss out in parallelism (# of splits)

If the column on which the FILTER is applied is a non-partitioned column then Dremio would do a filter push down instead of pruning. Currently filter pushdown, Dremio only does one

It is also very important that the ETL job does not create too many small files. In S3, one row group is a split and a split drives parallelism. If you there are too many splits then it can cause an over head in opening files, reading footers and closing files. On the contrary too less splits can cause the query to be under parallelized.

Let us know if you have any further questions

Thanks
Bali

Topic		Replies	Views
Dremio seems to be scanning all the partitoned data from s3	5	1563	December 14, 2022
Improve S3 Parquet mapping and metadata update	2	1380	January 8, 2020
Single Large CSV file to Parquet	5	2781	February 2, 2021
Dremio Cloud: JSON to Iceberg materialization fails on planning (max_splits) Dremio Cloud	2	908	January 20, 2023
METADATA REFRESH PER PARTITION - Duplicated data	1	124	June 13, 2024

S3 folders with extreme amount of files at the top level (not partitioned parquet-style)

Related topics