If you view the profile in the Jobs page, you can click on the Planning tab. Scroll down to the section that say “Final Physical Transformation”. You will see a tree of relation operators, one of which is the “ElasticScan”. Next to that, you will see a json string which represent the Elasticsearch query which dremio sent to Elasticsearch. For example, in the first query you tried:
03-02 ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`*`]], pushdown
=[{
"size" : 0,
"query" : {
"match_all" : { }
},
"aggregations" : {
"EXPR$0" : {
"terms" : {
"script" : {
"inline" : "(def) ((doc[\"Date\"].empty) ? null : doc[\"Date\"].date.monthOfYear)",
"lang" : "painless"
},
"missing" : -9223372036854775808,
"size" : 2147483647
}
}
}
}])
This means that Dremio pushed the aggregation to Elasticsearch, which was able to compute it very quickly by using the index, and then only had to send the much smaller result to Dremio.
For the second query you tried, you will see that we did not push down the aggregate:
03-04 ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`Date`]], pushdown
=[{
"from" : 0,
"size" : 4000,
"query" : {
"match_all" : { }
}
}])
This means that Dremio is pulling all of the data from Elasticsearch. Reading and parsing all of the data from elasticsearch takes up the vast majority of the total query time.
And the fact that I used a subquery isn’t really important, I just did it that way to avoid typing extract(month) over and over again, but you could also write the query this way:
SELECT CASE Extract(month FROM "date")
WHEN 1 THEN 'January'
WHEN 2 THEN 'February'
WHEN 3 THEN 'March'
WHEN 4 THEN 'April'
WHEN 5 THEN 'May'
WHEN 6 THEN 'June'
WHEN 7 THEN 'July'
WHEN 8 THEN 'August'
WHEN 9 THEN 'September'
WHEN 10 THEN 'October'
WHEN 11 THEN 'November'
WHEN 12 THEN 'December'
END AS "Month",
Count(*) AS "COUNT"
FROM "ES GCS Data"."gge-gcsdata-index".generictype
GROUP BY Extract(month FROM "date")
The key is we already saw that Dremio was able to push down the aggregate when the group by was just the extract function, but when we tried to group by the to_char function or the case statement, we were unable to pushdown. We don’t really need to convert to the name of the month in order to do the grouping, so it makes sense to just group by the simpler extract month function, and then perform the conversion to a string on the smaller result.