Dremio Cluster Capacity Planning

Could you please provide some guidance/documents related to Dremio executor/secondary coordinator capacity planning based on query load for on Perm and cloud.

What are the important parameters we will have consider for Dremio cluster capacity.

Thanks in advance.


There are couple important things to consider

  • Performance
  • Capacity planning to achieve the desired performance

We need to do the above for

  • Metadata stored in RocksDB (KVStore)
  • Coordinator for query planning
  • Executors for query execution including reflections

Let us start with Metadata, there are some best practices here

  • What should be the Metadata refresh settings? This is very important configuration as it can have implications on data freshness SLA, size of RocksDB and performance of the master coordinator. We should set refresh every based on the data freshness frequency. For example if the ETL adds data every 8 hours there is no use of refreshing every hour (which is the default), or if the dashboard is only run two times then we probably need to only refresh 10 hours once. The more we refresh the more copies in RocksDB and bigger the size of the RocksDB is going to be
  • Do we need to scale out coordinators? This purely depends on concurrency. If you start seeing high command pool wait times but very less planning times then scale coordinator will help but if you see high planning times along with high command pool wait times then we need to investigate of the high planning times is a cause or symptom
  • For execution, if your query does not meet the SLA or you think should have run faster then the job profile has all the answers. If you open the profile and all the time was in Sleep time (under threads overview) but shown as “Waiting” under phase metrics. If only a few threads show high Waiting time then validate if they are all on the same host and check host if they are evenly distributed then it is probably concurrency. If time is spent on process, then it is either the data or the server is busy and processing can be affected. If all the time is on Wait time (in operators overview) then it is IO. Once we are able to identify at this level we can drill down in detail

Please get back to us once you have done the first round for further questions