I know, the correct answer to this question/request is ‘it depends’, but your sizing recommendation cant be the answer to anything (that is 42), am I right?
Minimum hardware for executors:
4 CPU cores (16 cores recommended)
16GB RAM (64GB recommended)
It would be good to get a basic understand about the basic drivers ans aspects of sizing e.g.:
many smaller nodes perform better (or worse) then a few large nodes for scenarios like this…
large datasets benefit from this…
many parallel users benefit from this…
last mile ETL scenarios with complex aggregative queries benefit from…
use SSD over HDD to gain this…
Dremio can benefit from GPUs under the condition…
etc.
It would be even better to have an online sizing assistant where I can describe my typical workloads and get a recommendation. This would also help to get an idea about infrastructure costs to expect with Dremio.
Example: Let’s assume I have 50TB of raw data in 20000+ parquet files over 200 tables (all in ADLS) and up to 100 parallel users to serve (3000 in total) doing slice and dice via Tableau and PowerBI.
Where will I then end up? 5, 20 or 100 executor nodes? And what size should these node have? Each 16 cores and 64GB (as you recommend) and I’m perfectly done? Do I gain from SSDs?
These are great questions. Unfortunately, it depends (sorry) and I think something like a calculator that spits out a specific config is not imminent at this time. If you have one you think is good for another data infrastructure product, I would love to see it - perhaps it will help us get closer, sooner.
Meanwhile, we are planning to update our Deployment Considerations Guide this month in anticipation of Dremio 3.0. I think we could add a number of suggestions and best practices that would help users narrow the range of possibilities.
A few things to consider:
More aggregate RAM is better.
If your queries have large joins/aggregations, these will need to fit in memory and you should size appropriately. Similarly, high concurrency will make more use of memory.
For a given aggregate memory, in many cases a smaller number of nodes will be better. 1TB spread across 100 nodes is not as efficient as 1TB spread across 10 nodes, the later will incur less shuffling of intermediate aggregations.
Coordinator nodes make use of heap, executor nodes manage memory off heap to avoid GC issues.
For data reflections, faster storage sub-systems matter. Seek times can vary significantly between systems (DAS, HDFS, S3, etc). It is probably a good exercise to test different configurations to see how sensitive your workloads are to this aspect of your deployment. That said, the costs may not justify the benefits. Choice is good (and a tyrant)!
Dremio does not currently support GPU. We hope to one day.
We will keep your questions here in mind as we think of additional recommendations for our deployments guide. If you have other things you think we should consider, please post back here and we’ll keep listening.