Dremio Cluster Capacity Planning

mahejava · January 10, 2021, 5:34pm

Hi,
Could you please provide some guidance/documents related to Dremio executor/secondary coordinator capacity planning based on query load for on Perm and cloud.

What are the important parameters we will have consider for Dremio cluster capacity.

Thanks in advance.
Mahendran

balaji.ramaswamy · January 11, 2021, 2:52am

@mahejava

There are couple important things to consider

Performance
Capacity planning to achieve the desired performance

We need to do the above for

Metadata stored in RocksDB (KVStore)
Coordinator for query planning
Executors for query execution including reflections

Let us start with Metadata, there are some best practices here

What should be the Metadata refresh settings? This is very important configuration as it can have implications on data freshness SLA, size of RocksDB and performance of the master coordinator. We should set refresh every based on the data freshness frequency. For example if the ETL adds data every 8 hours there is no use of refreshing every hour (which is the default), or if the dashboard is only run two times then we probably need to only refresh 10 hours once. The more we refresh the more copies in RocksDB and bigger the size of the RocksDB is going to be
Do we need to scale out coordinators? This purely depends on concurrency. If you start seeing high command pool wait times but very less planning times then scale coordinator will help but if you see high planning times along with high command pool wait times then we need to investigate of the high planning times is a cause or symptom
For execution, if your query does not meet the SLA or you think should have run faster then the job profile has all the answers. If you open the profile and all the time was in Sleep time (under threads overview) but shown as “Waiting” under phase metrics. If only a few threads show high Waiting time then validate if they are all on the same host and check host if they are evenly distributed then it is probably concurrency. If time is spent on process, then it is either the data or the server is busy and processing can be affected. If all the time is on Wait time (in operators overview) then it is IO. Once we are able to identify at this level we can drill down in detail

Please get back to us once you have done the first round for further questions

Thanks
Bali

Topic		Replies	Views
How to scale up dremio cluster?	4	594	September 5, 2023
Metadata management bottleneck	3	990	April 9, 2021
How to improve query response times?	0	1172	July 15, 2020
Query was cancelled planning time exceeded 60 seconds	3	1999	June 7, 2018
AutoScaling dremio	3	1718	February 3, 2022

Dremio Cluster Capacity Planning

Related topics