How many of partitions does dremio support and time to refresh

Hi,

Good Day! We are planning to implement query engine on top of our Azure data lake. And currently comparing Starburst presto with dremio

Have a question on the number of partitions does dremio support for a table. From the document, I understand the max number of partitions/splits that is supported in dremio is 60K. Please correct me if am wrong.

Also, what is the time does it take to refresh a table whenever a new partition is added.

For my current use case, we have different tables and a max of 100 partitions (job_id column) may be created for each table per day, and we have to maintain around 7 years of data. So in this case the number of partitions would be around 250K. So afraid if dremio supports this.

And how much time does dremio take to refresh a table when a new partitions is added when there are too many partitions (each partition may hold 10MB - 1GB of data).

Below is the high-level folder structure that we are currently planning to come up with.

output
|_package_1
| |_table1
| | |_job_id=1
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=2
| | | |_part_000.parquet
| | | |_part_001.parquet
| |
| |_table2
| | |_job_id=1
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=2
| | | |_part_000.parquet
| | | |_part_001.parquet
| |
| |_table3
| | |_job_id=1
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=2
| | | |_part_000.parquet
| | | |part_001.parquet
|
|
|
package_2
| |_table1
| | |_job_id=3
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=4
| | | |_part_000.parquet
| | | |_part_001.parquet
| |
| |_table2
| | |_job_id=3
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=4
| | | |_part_000.parquet
| | | |_part_001.parquet
| |
| |_table3
| | |_job_id=3
| | | |_part_000.parquet
| | | |_part_001.parquet
| | |_job_id=4
| | | |_part_000.parquet
| | | |_part_001.parquet

Thanks

@snelaturu

Number of split is the number of groups, in case of PARQUET on a file system source (Azure in your case), As such we do not have a limit on the # of partitions but as # of partition increase, splits would go up. It would be good to size the rowgroup to right size (usually 128 MB/256 MB)

Only if you are using reflections, Dremio would copy data. If not Dremio only needs to refresh metadata which should be quick unless the files are too small and too many. Currently, Dremio does not incrementally refresh metadata only for new partitions but does it all. Later this year as part of DMP 2.0 there should be incremental refresh

Also it is important you turn on C3 so we do not have IO wait times on Azure

Below are some useful links

https://docs.dremio.com/deployment/cloud-cache-config.html
https://docs.dremio.com/advanced-administration/metadata-caching.html
https://docs.dremio.com/acceleration/creating-reflections.html

Also white papers on Reflections and Semantic Layer are useful

1 Like

Hi @balaji.ramaswamy , thanks for the response. I understand the below based on your update. Please correct me if am wrong.

  1. Number of split is the number of group – sorry didnt get this. Does it mean that the number of parquet files under a partition?

  2. There are no limitations to number of partitions in a table.

  3. No incremental support, so when try to refresh metadata after adding new partitions, it will refresh the entire metadata. Will there be any performance issue. Our case is a real-time. We have to immediately refresh metadata whenever there is a new partition created in Azure. But would this impact performance.

Thanks
Subba

#1 Is the number of Parquet row groups, see your ETL job to see the row group size
#3 If you want real time querying using Dremio, we are coming up with a feature later this year using Apache Iceberg but until then you can only refresh the dataset that has new partitions

http://docs.dremio.com/sql-reference/sql-commands/datasets.html#forgetting-physical-dataset-metadata