Dremio performance on Elasticsearch cluster

Hi
We use Dremio 2.05 on Elasticsearch 5.6.9
Dremio stands on its own server with 4 cpus et 8G of RAM
ES is distributed on a 3 servers cluster each of them with 4 cpus and 16Go of RAM
The 3 ES nodes are data nodes and one of them is a master node as well

I built a dataset that runs in 12 minutes from dremio
I retrieved the DSL query (see below) and ran it as CURL command:

  • From Dremio server itself it took 3’ to send results
  • From ES master node it took as well 3’ to send results

I then deduce i have no network problems between dremio and ES machines

When doing all these tests i was focussing ressource usage on each ES nodes and what i can tell is
- When querying from dremio app cpus are scarcely used during the refresh period
- When doing the CURL tests cpus working hard

In my ES source definition i mentionned the ES master node only. I tried two list all the nodes but refresh time is the same

Can someone help me understanding this gap of performance when using dremio app?
Please don’t start with using reflexion for i think i want to make dremio app work properly before going further with features.

Here is the DSL query:

=[{
“size” : 0,
“query” : {
“bool” : {
“must” : [ {
“bool” : {
“should” : [ {
“match” : {
“UC” : {
“query” : “DOMUS-ACADEMY”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “DOMUS-ACADEMY-LANDING”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “NABA-LP”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “NABA”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “ISTITUTO-MARANGONI”,
“type” : “boolean”
}
}
} ]
}
}, {
“range” : {
“Date” : {
“from” : “2016-09-06T00:00:00.000Z”,
“to” : null,
“format” : “date_time”,
“include_lower” : true,
“include_upper” : true
}
}
} ]
}
},
“aggregations” : {
“UC” : {
“terms” : {
“field” : “UC”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Year” : {
“terms” : {
“field” : “Year”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“Month” : {
“terms” : {
“field” : “Month”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“Monthlabel” : {
“terms” : {
“field” : “MonthLabel”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“week” : {
“terms” : {
“field” : “Week”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“continent” : {
“terms” : {
“field” : “Continent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“country” : {
“terms” : {
“field” : “Country”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“device” : {
“terms” : {
“field” : “Device”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Query without accent” : {
“terms” : {
“field” : “Query without accent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Clicks” : {
“sum” : {
“field” : “Clicks”
}
},
“Position” : {
“sum” : {
“field” : “Position”
}
},
“impressions” : {
“sum” : {
“field” : “Impressions”
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}]

Can we look at this for Dremio 2.1? There’s a ton of improvements that have been made related to ES, and I think it would be best to investigate based on this version.

Thx Kelly
I received the anouncement of 2.1 this morning. Unfortunately i need to go to production to validate my proof of concept in a short time and can’t handle a dremio/ES migration in between.
I would then appreciate advices on the current situation

How are you running the query through Dremio? Through the SQL console?

What do you mean by “refresh” in your description above?

When you say a dataset that runs in 12 min, what do you mean? Do you mean the VDS you created in Dremio takes 12 min to return SELECT *?

Can you share a query profile?

If you aren’t using data reflections then it may be the case most of the work is done in ES - the query profile will help us see where cycles are being spent.

Hi
i ran the query in 3 different way

  • From dremio interface: 12’ before i get results in the panel
  • From Tableau software client: 12’ before i get me extract
  • From a CURL command ran from my PC: 3’ before the results start to scroll

The goal is to run it in tableau (extract mode)
–> i want to retrieve in tableau the less rows as possible, this is the reason for my queries consist in group by queries.

By X’ to refresh i mean X’ to start and send results.

Please find hereby the query profile.
The DSL query i found in profile/planning menu is the query i used for my CURL test
curl -XGET ‘server:port/index/_search’ -H ‘Content-Type: application/json’ -d ’
{
“query” : {

6db7f5e5-6e71-4fc3-bfa7-90e88707c284.zip (14,8 Ko)

Unfortunately, I don’t seem to be able to download that profile. Would you mind attaching again?

c36f9b4e-d86b-4e6b-b1c1-0a60a1e5745f.zip (13,9 Ko)

It looks like most of the time is spent waiting on ES to return results to Dremio. It is unclear why this is happening.

Thanks for the alert
Can you tell me where in the profile you notice the phenomena, so i can give my architect some guidelines to investigate

Thx

Hi
i intercepted the query (dremio and curl version) with tcdump and wireshark and the results below show that dremio sent to ES by dremio is accompanied with specific parameter on shards.
Can somebody explain it to me (the reason and the impact)

#curl
#url
http://XXXXX:9200/gge-gcsdata-index/_search
#json request body
{
“size” : 0,
“query” : {
“match_all” : { }
},
“aggregations” : {
“Date” : {
“terms” : {
“field” : “Date”,
“missing” : “01/01/1970”,
“size” : 2147483647
}
}
}
}
#dremio
#url
http://XXXXX:9200/gge-gcsdata-index/generictype/_search?preference=_shards%3A9&scroll=1000000ms
##json request body
{
“size” : 0,
“query” : {
“match_all” : { }
},
“aggregations” : {
“Date” : {
“terms” : {
“field” : “Date”,
“missing” : “01/01/1970”,
“size” : 2147483647
}
}
}
}