Configuring caching in v20.1 -- poor performance

darkling · June 27, 2022, 3:45pm

(Sorry, I accidentally posted this in the wrong category just now)

We’ve got a Dremio v17 system in production, which is giving us query results in generally under a second. We’re trying to deploy a v20.1 to take over from it, but the query results are taking roughly an order of magnitude longer (10-20s).

Following the instructions, we have

executor.cache.path.db: "/opt/dremio/data/cloudcache"
executor.cache.path.fs: ["/opt/dremio/data/cloudcache"]

set on both v17 and v20.1. However, the instructions also mention a reflection.cloud.cache.enabled setting in the debug section. So far, we haven’t been able to write a valid config that involves this setting – the worker doesn’t come up properly.

Our conclusion is that either that option isn’t needed (which it doesn’t seem to be on v17), and the caching isn’t working on v20.1, or that the option is needed in some other form on v20.1, and the instructions are incorrect.

Can anyone suggest why we’re seeing these much slower queries in v20.1? Have we configured the caching correctly? If so, why is the caching not working; if not, what configuration are we missing?

balaji.ramaswamy · June 28, 2022, 5:21am

@darkling Send the 2 profiles, from 17.x and 20.x and I can help figure out what happened

darkling · June 28, 2022, 8:44am

Hi, Balaji,

I hope these are the right thing:

dremio_blue_v20.1_profile.zip (17.0 KB)
dremio-test-v17_profile.zip (17.1 KB)

Hugo.

balaji.ramaswamy · July 2, 2022, 6:12am

@darkling your 17.x query is executed on 2 nodes while the 20.x version executed on 1 node. In 17.x you have configured the master also as an executor

darkling · July 4, 2022, 1:01pm

We’ve reconfigured the master to act as an executor as well (although we hadn’t originally intended that to be the case). Some queries seem to be faster:
dremio20.1_fast_query_e17efd69-3883-4796-b06e-88ceb173f013.zip (18.3 KB)
but others remain slow:
dremio20.1_slow_query_1169708e-bcd1-4082-af06-41497926d70e.zip (18.0 KB)

balaji.ramaswamy · July 4, 2022, 8:01pm

@darkling

Did not review the fast query but on the slow one, it seems the filter push down has a very long string evaluated in Java causing the ~ 13s process time.

filters=[[Filter on `Location_URI`: equal(`Location_URI`, 'http://data.agrimetrics.co.uk/grid-5km-bng/H6NSB16du3e3hr1S') , Filter on `Forecast_generation_datetime`: booleanAnd(greater_than_or_equal_to(`Forecast_generation_datetime`, cast( 1624665600000l as TIMESTAMP)) , less_than(`Forecast_generation_datetime`, cast( 1625529600000l as TIMESTAMP)) ) ]])

https://docs.dremio.com/software/advanced-administration/gandiva/

darkling · July 5, 2022, 2:26pm

The problem we’ve got here is that the exact same query is wildly inconsistent – it can be 14s sometimes and 1s other times, and there’s no consistency between them that I can see. If it can execute this query quickly sometimes, why can’t it do it all the time? The system is not heavily loaded, and the v17 instance (which shares the same data store) is performing consistently well.

I accept that the long string might also be a performance problem, but given that we’re not (consciously) using Gandiva on either system, and we’re getting acceptable performance on the v17 one, it seems unlikely that this would be a solution to the problem.

(As an aside, I see the text below in the profile, which at least to me implies that we are using Gandiva anyway:

  {
    "dependsOn" : [],
    "inGandiva" : true,
    "namedExpression" : "NamedExpression [expr=ValueExpression[quoted_string=http://data.agrimetrics.co.uk/grid-5km-bng/H6NSB16du3e3hr1S], ref=`location`]",
    "optimize" : true,
    "outputName" : "location"
  },

)

Looking through a side-by-side comparison of the profiles, we see these fragments in the slow one:

   "fragmentProfile" : [
      {
         "majorFragmentId" : 0,
         "minorFragmentProfile" : [
            {
               "blockedDuration" : 12888,
               "blockedOnDownstreamDuration" : 1,
               "blockedOnSharedResourceDuration" : 12887,
               "blockedOnUpstreamDuration" : 0,

and, shortly afterwards:

                  {
                     "category" : 0,
                     "duration" : 12874,
                     "resource" : "UNORDERED_RECV_MEM_BUFFER"
                  },

and these equivalent fragments in the fast one:

   "fragmentProfile" : [
      {
         "majorFragmentId" : 0,
         "minorFragmentProfile" : [
            {
               "blockedDuration" : 297,
               "blockedOnDownstreamDuration" : 4,
               "blockedOnSharedResourceDuration" : 293,
               "blockedOnUpstreamDuration" : 0,

and

                  {
                     "category" : 0,
                     "duration" : 278,
                     "resource" : "UNORDERED_RECV_MEM_BUFFER"
                  },

Everything else seems at least similar in size, and the duration / blockedDuration values for the slow one is around the same time (14s) as the total query time. So, my (possibly naive) conclusion would be that we’ve got an I/O bottleneck somewhere – but why might this show up only on the v20 installation and not the v17 one?

balaji.ramaswamy · July 6, 2022, 2:56am

@darkling Would you be able to send us the profile for the 1s run?

darkling · July 6, 2022, 8:23am

Here’s the 1s one:

dremio20.1_fast_query_e17efd69-3883-4796-b06e-88ceb173f013.zip (18.3 KB)

balaji.ramaswamy · July 8, 2022, 3:52am

@darkling Any chance your master node is more powerful and than your executor node?

darkling · July 8, 2022, 8:25am

No, the master is an Azure D4_v3, which is 4 CPUs and 16 GiB of RAM. The executor is a DS4_v2, whcih is 8 CPUs and 14 GiB of RAM. Again, I don’t think this is likely to be the problem, as we have the same configuration for the v17 setup and we don’t see the slowdown on that.

balaji.ramaswamy · July 11, 2022, 6:34am

@darkling IS it possible to run the query one more time and upload the log file from the 2 nodes along with the job profile?

simon-agrimetrics · July 12, 2022, 9:26am

dremio-logs.zip (542.2 KB)
test_v17.zip (15.8 KB)

Here are the requested files. Thank you.

simon-agrimetrics · July 12, 2022, 9:26am

test_v20.zip

darkling · July 12, 2022, 12:35pm

Just a quick note to say that I’ve rotated off looking at this particular issue, and Simon, who posted above, is taking over from me. (He’s fully up to speed on the issue; we’re just handing over who’s driving).

simon-agrimetrics · July 21, 2022, 8:54am

@balaji.ramaswamy Were these profiles helpful?

balaji.ramaswamy · August 26, 2022, 3:45am

@simon-agrimetrics I see a big wait time on the Parquet scan, even though all reads are coming from cache (local disk), are your local disks SSD’s?

Topic		Replies	Views
How to improve query response times?	0	1171	July 15, 2020
How to Optimize Query Performance for Large Datasets in Dremio Dremio Cloud	1	446	June 21, 2024
Columnar Cloud Cache (C3)	4	1858	October 10, 2019
Dremio Queries Painfully Slow	3	2044	November 7, 2019
Dremio on kubernate volume mount take more time to load data	8	1774	November 20, 2018

Configuring caching in v20.1 -- poor performance

Related topics