Dremio performance on Elasticsearch cluster

xbry · September 6, 2018, 9:32am

Hi
We use Dremio 2.05 on Elasticsearch 5.6.9
Dremio stands on its own server with 4 cpus et 8G of RAM
ES is distributed on a 3 servers cluster each of them with 4 cpus and 16Go of RAM
The 3 ES nodes are data nodes and one of them is a master node as well

I built a dataset that runs in 12 minutes from dremio
I retrieved the DSL query (see below) and ran it as CURL command:

From Dremio server itself it took 3’ to send results
From ES master node it took as well 3’ to send results

I then deduce i have no network problems between dremio and ES machines

When doing all these tests i was focussing ressource usage on each ES nodes and what i can tell is
- When querying from dremio app cpus are scarcely used during the refresh period
- When doing the CURL tests cpus working hard

In my ES source definition i mentionned the ES master node only. I tried two list all the nodes but refresh time is the same

Can someone help me understanding this gap of performance when using dremio app?
Please don’t start with using reflexion for i think i want to make dremio app work properly before going further with features.

Here is the DSL query:

=[{
“size” : 0,
“query” : {
“bool” : {
“must” : [ {
“bool” : {
“should” : [ {
“match” : {
“UC” : {
“query” : “DOMUS-ACADEMY”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “DOMUS-ACADEMY-LANDING”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “NABA-LP”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “NABA”,
“type” : “boolean”
}
}
}, {
“match” : {
“UC” : {
“query” : “ISTITUTO-MARANGONI”,
“type” : “boolean”
}
}
} ]
}
}, {
“range” : {
“Date” : {
“from” : “2016-09-06T00:00:00.000Z”,
“to” : null,
“format” : “date_time”,
“include_lower” : true,
“include_upper” : true
}
}
} ]
}
},
“aggregations” : {
“UC” : {
“terms” : {
“field” : “UC”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Year” : {
“terms” : {
“field” : “Year”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“Month” : {
“terms” : {
“field” : “Month”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“Monthlabel” : {
“terms” : {
“field” : “MonthLabel”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“week” : {
“terms” : {
“field” : “Week”,
“missing” : -2147483648,
“size” : 2147483647
},
“aggregations” : {
“continent” : {
“terms” : {
“field” : “Continent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“country” : {
“terms” : {
“field” : “Country”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“device” : {
“terms” : {
“field” : “Device”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Query without accent” : {
“terms” : {
“field” : “Query without accent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Clicks” : {
“sum” : {
“field” : “Clicks”
}
},
“Position” : {
“sum” : {
“field” : “Position”
}
},
“impressions” : {
“sum” : {
“field” : “Impressions”
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}
}]

kelly · September 6, 2018, 2:50pm

Can we look at this for Dremio 2.1? There’s a ton of improvements that have been made related to ES, and I think it would be best to investigate based on this version.

xbry · September 6, 2018, 4:33pm

Thx Kelly
I received the anouncement of 2.1 this morning. Unfortunately i need to go to production to validate my proof of concept in a short time and can’t handle a dremio/ES migration in between.
I would then appreciate advices on the current situation

kelly · September 6, 2018, 4:52pm

How are you running the query through Dremio? Through the SQL console?

What do you mean by “refresh” in your description above?

When you say a dataset that runs in 12 min, what do you mean? Do you mean the VDS you created in Dremio takes 12 min to return SELECT *?

Can you share a query profile?

If you aren’t using data reflections then it may be the case most of the work is done in ES - the query profile will help us see where cycles are being spent.

xbry · September 7, 2018, 7:27am

Hi
i ran the query in 3 different way

From dremio interface: 12’ before i get results in the panel
From Tableau software client: 12’ before i get me extract
From a CURL command ran from my PC: 3’ before the results start to scroll

The goal is to run it in tableau (extract mode)
–> i want to retrieve in tableau the less rows as possible, this is the reason for my queries consist in group by queries.

By X’ to refresh i mean X’ to start and send results.

Please find hereby the query profile.
The DSL query i found in profile/planning menu is the query i used for my CURL test
curl -XGET ‘server:port/index/_search’ -H ‘Content-Type: application/json’ -d ’
{
“query” : {

xbry · September 7, 2018, 7:29am

6db7f5e5-6e71-4fc3-bfa7-90e88707c284.zip (14,8 Ko)

kelly · September 10, 2018, 4:18pm

Unfortunately, I don’t seem to be able to download that profile. Would you mind attaching again?

xbry · September 11, 2018, 9:03am

c36f9b4e-d86b-4e6b-b1c1-0a60a1e5745f.zip (13,9 Ko)

kelly · September 12, 2018, 5:27am

It looks like most of the time is spent waiting on ES to return results to Dremio. It is unclear why this is happening.

xbry · September 12, 2018, 7:26am

Thanks for the alert
Can you tell me where in the profile you notice the phenomena, so i can give my architect some guidelines to investigate

Thx

xbry · September 21, 2018, 11:01am

Hi
i intercepted the query (dremio and curl version) with tcdump and wireshark and the results below show that dremio sent to ES by dremio is accompanied with specific parameter on shards.
Can somebody explain it to me (the reason and the impact)

#curl
#url
http://XXXXX:9200/gge-gcsdata-index/_search
#json request body
{
“size” : 0,
“query” : {
“match_all” : { }
},
“aggregations” : {
“Date” : {
“terms” : {
“field” : “Date”,
“missing” : “01/01/1970”,
“size” : 2147483647
}
}
}
}
#dremio
#url
http://XXXXX:9200/gge-gcsdata-index/generictype/_search?preference=_shards%3A9&scroll=1000000ms
##json request body
{
“size” : 0,
“query” : {
“match_all” : { }
},
“aggregations” : {
“Date” : {
“terms” : {
“field” : “Date”,
“missing” : “01/01/1970”,
“size” : 2147483647
}
}
}
}

Topic		Replies	Views
Questions around Elasticsearch-Dremio integration	1	1594	November 14, 2018
How to speed up dremio	8	3424	August 1, 2018
How to get all data of ElasticSearch to Dremio	5	1584	August 11, 2022
ES-Dremio: error in retrieving tables	12	1523	December 7, 2018
Will data load imediately from elasticsearch to dremio database?	4	922	September 8, 2022

Dremio performance on Elasticsearch cluster

Related topics