ES Query with timestamp transformation

xbry · June 29, 2018, 5:00pm

HI
I use the community edition 1.3.14 on Elasticsearch 5.6.9, and i’m new to both technologies. Eventually i will plug Tableau and dremio but at the moment i’m facing a performance problem transforming a timestamp field into a month format
here is the field as defined in my ES index
“Date”:{“type”:“date”,“format”:“dd/MM/yyyy”},

Dremio recognises it as a date/time format.

There is about 15 millions rows in the index at the moment

I want to aggregate data on month level and i ran this query that took 1 second

select Extract(Month from “Date”), count(*)
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
Extract(Month from “Date”)

But i want the month label so i ran this query instead and it took 3 minutes

select to_char(CAST(“Date” As date),‘MONTH’), count(*)
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
to_char(CAST(“Date” As date),‘MONTH’)

I’m wondering how dremio planned both queries, why the time to run are so different and how i should code my query to get the month label aggregation quickly.

kelly · June 29, 2018, 11:23pm

First, I would strong suggest upgrading Dremio to the latest. You are several months and a major version behind.

Also, I would suggest creating a data reflection on this dataset to improve performance. Here’s a tutorial.

Can you share the two job profiles and we can take a look?

You could also potentially rewrite the query to apply the month label as a lookup or CASE statement.

xbry · June 30, 2018, 1:04pm

Hi
I tried the CASE WHEN and it went as long as the CAST, about 3’.
I join to the post the query profil of each try.

I will investigate with reflexion later on but for the moment i need to understand how Dremio plans the query and why ES reacts this way for i can develop performant queries. Profile-Query-CASE-function.zip (8,8 Ko)
Profile-Query-CAST-function.zip (8,2 Ko)

Thanks

xbry · June 30, 2018, 1:06pm

Profile-Qyery-Extract-function.zip (7,4 Ko)

steven · June 30, 2018, 8:42pm

The second query is slow because it looks like Dremio is unable to push down the aggregation to Elastic search when the group by clause has the to_char() function.

Could you try something like this:

select CASE when m=1 then ‘January’
when m=2 then ‘February’
when m=3 then ‘March’
when m=4 then ‘April’
when m=5 then ‘May’
when m=6 then ‘June’
when m=7 then ‘July’
when m=8 then ‘August’
when m=9 then ‘September’
when m=10 then ‘October’
when m=11 then ‘November’
when m=12 then ‘December’
end as “month”,
cnt
from
(select Extract(Month from “Date”) as m, count(*) as cnt
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
Extract(Month from “Date”))

xbry · July 1, 2018, 10:21am

Hi Steven
You are right and the query is far more performant this way.
I’m a bit confused there because sub querying is normally not the right way to perform with SQL.
Could you tell me how you investigated my logs and how you concluded that the CAST was not pushed down and finally explain to me how was the data transformation flow handled from the query to ES to the delivery by Dremio.
I must be able to perform investigations by myself in the future because i realize that what i consider as good SQL does not really fits dremio.

Thanks a lot

Xavier

steven · July 4, 2018, 7:04pm

If you view the profile in the Jobs page, you can click on the Planning tab. Scroll down to the section that say “Final Physical Transformation”. You will see a tree of relation operators, one of which is the “ElasticScan”. Next to that, you will see a json string which represent the Elasticsearch query which dremio sent to Elasticsearch. For example, in the first query you tried:

03-02                            ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`*`]], pushdown
 =[{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "EXPR$0" : {
      "terms" : {
        "script" : {
          "inline" : "(def) ((doc[\"Date\"].empty) ? null : doc[\"Date\"].date.monthOfYear)",
          "lang" : "painless"
        },
        "missing" : -9223372036854775808,
        "size" : 2147483647
      }
    }
  }
}])

This means that Dremio pushed the aggregation to Elasticsearch, which was able to compute it very quickly by using the index, and then only had to send the much smaller result to Dremio.

For the second query you tried, you will see that we did not push down the aggregate:

03-04                                ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`Date`]], pushdown
 =[{
  "from" : 0,
  "size" : 4000,
  "query" : {
    "match_all" : { }
  }
}])

This means that Dremio is pulling all of the data from Elasticsearch. Reading and parsing all of the data from elasticsearch takes up the vast majority of the total query time.

And the fact that I used a subquery isn’t really important, I just did it that way to avoid typing extract(month) over and over again, but you could also write the query this way:

SELECT CASE Extract(month FROM "date") 
         WHEN 1 THEN 'January' 
         WHEN 2 THEN 'February' 
         WHEN 3 THEN 'March' 
         WHEN 4 THEN 'April' 
         WHEN 5 THEN 'May' 
         WHEN 6 THEN 'June' 
         WHEN 7 THEN 'July' 
         WHEN 8 THEN 'August' 
         WHEN 9 THEN 'September' 
         WHEN 10 THEN 'October' 
         WHEN 11 THEN 'November' 
         WHEN 12 THEN 'December' 
       END AS "Month", 
       Count(*) AS "COUNT"
FROM   "ES GCS Data"."gge-gcsdata-index".generictype 
GROUP  BY Extract(month FROM "date")

The key is we already saw that Dremio was able to push down the aggregate when the group by was just the extract function, but when we tried to group by the to_char function or the case statement, we were unable to pushdown. We don’t really need to convert to the name of the month in order to do the grouping, so it makes sense to just group by the simpler extract month function, and then perform the conversion to a string on the smaller result.

xbry · July 5, 2018, 3:50pm

Hi Steven thx for the update.
I’m now able to check the pushdown after each query preview and make sure it is pushing the maximum of the computation.
Could you just give me your point of view on this.
I ran this query with a group by and a filter in the where clause.
SELECT
“UC”,
“Continent”,
“Country”,
“Device”,
count(*)
FROM “ES GCS Data”.“gge-gcsdata-index”.generictype
where “Country”=‘france’
GROUP BY
“UC”,
“Continent”,
“Country”,
“Device”

And notice Dremio does not push down the group by

[{
“from” : 0,
“size” : 1000,
“query” : {
“match” : {
“Country” : {
“query” : “france”,
“type” : “boolean”
}
}
}
}]

So he’s doing the group by by itself.

Diging deeper i noticed that if i filter the query

with a character field in the where clause
and that field is also in the group by section
dremio won’t push down the group by to ES

If i keep filtering

with a character field in the where clause
and that field is not in the group by section
dremio is pushing down the group by to ES like with the following query

SELECT
“UC”,
“Continent”,
“Device”,
count(*)
FROM “ES GCS Data”.“gge-gcsdata-index”.generictype
where “Country”=‘france’
GROUP BY
“UC”,
“Continent”,
“Device”

[{
“size” : 0,
“query” : {
“match” : {
“Country” : {
“query” : “france”,
“type” : “boolean”
}
}
},
“aggregations” : {
“UC” : {
“terms” : {
“field” : “UC”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Continent” : {
“terms” : {
“field” : “Continent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Device” : {
“terms” : {
“field” : “Device”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
}
}
}
}
}
}
}
}]

And i have not that problem when the filter id done on an integer field.

Topic		Replies	Views
Query pushdown on elasticsearch	12	1824	August 22, 2019
Elasticsearch "Failed to parse date time value" bug	5	2764	November 9, 2018
DSL query executed on ES	1	1070	August 24, 2018
Dremio performance on Elasticsearch cluster	10	1429	September 21, 2018
Questions around Elasticsearch-Dremio integration	1	1596	November 14, 2018

ES Query with timestamp transformation

Related topics