We’re evaluating dremio on a medium sized dataset (40gb), below are some more details on this setup.
Machine config: 1 coordinator (16gb 4cpu), 2 worker nodes (64gb 16cpu each), 600GB ssd each.
Data Source: 40gb csv file residing in a S3 bucket
Reflections: Raw reflection + aggregation reflection (on pdfs)
Worker node memory config: heap=10GB, direct=48gb
Memory config in admin panel: 15GB for large queries and 5gb for small queries (approx)
For this performance check we’ve devised a range of aggregation queries, query combinations include grouping on 1,2,3 dimension + a date, operation on a single measure, with and without any dimension/date filters
captured execution times:
ranging from 2sec - 1min, when queries are run sequencially
ranging from 2sec - 4mins when run parallelly
We have a few questions:
- Are we missing any crucial setting which is degrading the overall query performance?
- When we run the same query multiple times, the subsequent query executions take exactly same amount of time, shouldn’t the Cloud Cache kickin after a few runs? (i checked and it is enabled in the source settings though)
- What could be a reasonable number in terms of query execution time on this setup.
Attaching a few profiles of slow running queries for your reference.