Disk Space Utilisation and Query Execution Time

Zubair · May 25, 2018, 5:40am

Hi,

I am using community version of dremio with local mode installation for ‘Hive’ of data set size around 20 GB and facing following challenges
1- The query execution time is more than expectation.
2- If the same query is accelerated, its taking more time than normal
3- I created a an HDFS directory for Dremio user, it created user cache of 75 GB, resulting shortage of disk space,
4- Post 8 days of installation of dremio, /var/lib/dremio/pdfs/accelerator/ size is around 112 GB, resulting shortage of disk space,
5- RAM utilisation of machine was full(approx. 15 GB), when I restarted dremio service RAM utilisation reduced to 6 GB.

Can anyone please share the best practice for the above and explain me why it is generating so much of data against only 20GB hive dataset?

can · May 25, 2018, 10:49pm

Hey @Zubair, could you please share:

Reflection Refresh Policy for this Hive data source.
Counts of reflections of each type (raw and agg) that are enabled in the system.

Some thoughts:

1- The query execution time is more than expectation.
What are you comparing this against? What system, with what resources? Would be good to understand. What dataset volume (and query complexity) are you working with? What resources did you give Dremio?

Typically this happens if Dremio is underprovisionned for your dataset and workloads.

2- If the same query is accelerated, its taking more time than normal
Can’t help much without a query profile. Could you share one?

3- I created a an HDFS directory for Dremio user, it created user cache of 75 GB, resulting shortage of disk space,
4- Post 8 days of installation of dremio, /var/lib/dremio/pdfs/accelerator/ size is around 112 GB, resulting shortage of disk space,
Typically our customers provide Dremio with 100s of GBs to TBs of reflection storage capacity. Will depend on your refresh policy, reflection count (raw mostly) and dataset size/complexity.

5- RAM utilisation of machine was full(approx. 15 GB), when I restarted dremio service RAM utilisation reduced to 6 GB.
Sounds like a job was running and it was killed when you restarted? Hard to tell without more details – a bunch of things might be going on. Also sounds like this install might be under-provisioned.

Can anyone please share the best practice for the above and explain me why it is generating so much of data against only 20GB hive dataset?
A combination of 1) reflection refresh policy 2) raw reflection count 3) agg. reflection cardinality and count 4) dataset size

Topic		Replies	Views
Disk Space issue	1	1674	May 25, 2018
Dremio on kubernate volume mount take more time to load data	8	1776	November 20, 2018
Failed - Accelerator (creation)	4	1213	April 6, 2018
Evaluating Dremio	3	2112	May 17, 2018
How to speed up dremio	8	3444	August 1, 2018

Disk Space Utilisation and Query Execution Time

Related topics