Disk Space Utilisation and Query Execution Time

Hi,

I am using community version of dremio with local mode installation for ‘Hive’ of data set size around 20 GB and facing following challenges
1- The query execution time is more than expectation.
2- If the same query is accelerated, its taking more time than normal
3- I created a an HDFS directory for Dremio user, it created user cache of 75 GB, resulting shortage of disk space,
4- Post 8 days of installation of dremio, /var/lib/dremio/pdfs/accelerator/ size is around 112 GB, resulting shortage of disk space,
5- RAM utilisation of machine was full(approx. 15 GB), when I restarted dremio service RAM utilisation reduced to 6 GB.

Can anyone please share the best practice for the above and explain me why it is generating so much of data against only 20GB hive dataset?

Hey @Zubair, could you please share:

  • Reflection Refresh Policy for this Hive data source.
  • Counts of reflections of each type (raw and agg) that are enabled in the system.

Some thoughts:

1- The query execution time is more than expectation.
What are you comparing this against? What system, with what resources? Would be good to understand. What dataset volume (and query complexity) are you working with? What resources did you give Dremio?

Typically this happens if Dremio is underprovisionned for your dataset and workloads.

2- If the same query is accelerated, its taking more time than normal
Can’t help much without a query profile. Could you share one?

3- I created a an HDFS directory for Dremio user, it created user cache of 75 GB, resulting shortage of disk space,
4- Post 8 days of installation of dremio, /var/lib/dremio/pdfs/accelerator/ size is around 112 GB, resulting shortage of disk space,
Typically our customers provide Dremio with 100s of GBs to TBs of reflection storage capacity. Will depend on your refresh policy, reflection count (raw mostly) and dataset size/complexity.

5- RAM utilisation of machine was full(approx. 15 GB), when I restarted dremio service RAM utilisation reduced to 6 GB.
Sounds like a job was running and it was killed when you restarted? Hard to tell without more details – a bunch of things might be going on. Also sounds like this install might be under-provisioned.

Can anyone please share the best practice for the above and explain me why it is generating so much of data against only 20GB hive dataset?
A combination of 1) reflection refresh policy 2) raw reflection count 3) agg. reflection cardinality and count 4) dataset size