How to improve query response times?

Hi Everyone,

We’re evaluating dremio on a medium sized dataset (40gb), below are some more details on this setup.

Machine config: 1 coordinator (16gb 4cpu), 2 worker nodes (64gb 16cpu each), 600GB ssd each.
Data Source: 40gb csv file residing in a S3 bucket
Reflections: Raw reflection + aggregation reflection (on pdfs)
Worker node memory config: heap=10GB, direct=48gb
Memory config in admin panel: 15GB for large queries and 5gb for small queries (approx)

For this performance check we’ve devised a range of aggregation queries, query combinations include grouping on 1,2,3 dimension + a date, operation on a single measure, with and without any dimension/date filters

captured execution times:
ranging from 2sec - 1min, when queries are run sequencially
ranging from 2sec - 4mins when run parallelly

We have a few questions:

  • Are we missing any crucial setting which is degrading the overall query performance?
  • When we run the same query multiple times, the subsequent query executions take exactly same amount of time, shouldn’t the Cloud Cache kickin after a few runs? (i checked and it is enabled in the source settings though)
  • What could be a reasonable number in terms of query execution time on this setup.

Attaching a few profiles of slow running queries for your reference.

3m3s.zip (34.7 KB) 9m24s.zip (51.9 KB) 51s.zip (30.5 KB) 1m-47s.zip (33.3 KB)

Thanks,
Teja