ES Query with timestamp transformation

HI
I use the community edition 1.3.14 on Elasticsearch 5.6.9, and i’m new to both technologies. Eventually i will plug Tableau and dremio but at the moment i’m facing a performance problem transforming a timestamp field into a month format
here is the field as defined in my ES index
“Date”:{“type”:“date”,“format”:“dd/MM/yyyy”},

Dremio recognises it as a date/time format.

There is about 15 millions rows in the index at the moment

I want to aggregate data on month level and i ran this query that took 1 second

select Extract(Month from “Date”), count(*)
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
Extract(Month from “Date”)

But i want the month label so i ran this query instead and it took 3 minutes

select to_char(CAST(“Date” As date),‘MONTH’), count(*)
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
to_char(CAST(“Date” As date),‘MONTH’)

I’m wondering how dremio planned both queries, why the time to run are so different and how i should code my query to get the month label aggregation quickly.

First, I would strong suggest upgrading Dremio to the latest. You are several months and a major version behind. :slight_smile:

Also, I would suggest creating a data reflection on this dataset to improve performance. Here’s a tutorial.

Can you share the two job profiles and we can take a look?

You could also potentially rewrite the query to apply the month label as a lookup or CASE statement.

Hi
I tried the CASE WHEN and it went as long as the CAST, about 3’.
I join to the post the query profil of each try.

I will investigate with reflexion later on but for the moment i need to understand how Dremio plans the query and why ES reacts this way for i can develop performant queries. Profile-Query-CASE-function.zip (8,8 Ko)
Profile-Query-CAST-function.zip (8,2 Ko)

Thanks

Profile-Qyery-Extract-function.zip (7,4 Ko)

The second query is slow because it looks like Dremio is unable to push down the aggregation to Elastic search when the group by clause has the to_char() function.

Could you try something like this:

select CASE when m=1 then ‘January’
when m=2 then ‘February’
when m=3 then ‘March’
when m=4 then ‘April’
when m=5 then ‘May’
when m=6 then ‘June’
when m=7 then ‘July’
when m=8 then ‘August’
when m=9 then ‘September’
when m=10 then ‘October’
when m=11 then ‘November’
when m=12 then ‘December’
end as “month”,
cnt
from
(select Extract(Month from “Date”) as m, count(*) as cnt
from “ES GCS Data”.“gge-gcsdata-index”.generictype
group by
Extract(Month from “Date”))

Hi Steven
You are right and the query is far more performant this way.
I’m a bit confused there because sub querying is normally not the right way to perform with SQL.
Could you tell me how you investigated my logs and how you concluded that the CAST was not pushed down and finally explain to me how was the data transformation flow handled from the query to ES to the delivery by Dremio.
I must be able to perform investigations by myself in the future because i realize that what i consider as good SQL does not really fits dremio.

Thanks a lot

Xavier

If you view the profile in the Jobs page, you can click on the Planning tab. Scroll down to the section that say “Final Physical Transformation”. You will see a tree of relation operators, one of which is the “ElasticScan”. Next to that, you will see a json string which represent the Elasticsearch query which dremio sent to Elasticsearch. For example, in the first query you tried:

03-02                            ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`*`]], pushdown
 =[{
  "size" : 0,
  "query" : {
    "match_all" : { }
  },
  "aggregations" : {
    "EXPR$0" : {
      "terms" : {
        "script" : {
          "inline" : "(def) ((doc[\"Date\"].empty) ? null : doc[\"Date\"].date.monthOfYear)",
          "lang" : "painless"
        },
        "missing" : -9223372036854775808,
        "size" : 2147483647
      }
    }
  }
}])

This means that Dremio pushed the aggregation to Elasticsearch, which was able to compute it very quickly by using the index, and then only had to send the much smaller result to Dremio.

For the second query you tried, you will see that we did not push down the aggregate:

03-04                                ElasticScan(table=[[ES GCS Data, gge-gcsdata-index, generictype]], resource=[gge-gcsdata-index/generictype], columns=[[`Date`]], pushdown
 =[{
  "from" : 0,
  "size" : 4000,
  "query" : {
    "match_all" : { }
  }
}])

This means that Dremio is pulling all of the data from Elasticsearch. Reading and parsing all of the data from elasticsearch takes up the vast majority of the total query time.

And the fact that I used a subquery isn’t really important, I just did it that way to avoid typing extract(month) over and over again, but you could also write the query this way:

SELECT CASE Extract(month FROM "date") 
         WHEN 1 THEN 'January' 
         WHEN 2 THEN 'February' 
         WHEN 3 THEN 'March' 
         WHEN 4 THEN 'April' 
         WHEN 5 THEN 'May' 
         WHEN 6 THEN 'June' 
         WHEN 7 THEN 'July' 
         WHEN 8 THEN 'August' 
         WHEN 9 THEN 'September' 
         WHEN 10 THEN 'October' 
         WHEN 11 THEN 'November' 
         WHEN 12 THEN 'December' 
       END AS "Month", 
       Count(*) AS "COUNT"
FROM   "ES GCS Data"."gge-gcsdata-index".generictype 
GROUP  BY Extract(month FROM "date")

The key is we already saw that Dremio was able to push down the aggregate when the group by was just the extract function, but when we tried to group by the to_char function or the case statement, we were unable to pushdown. We don’t really need to convert to the name of the month in order to do the grouping, so it makes sense to just group by the simpler extract month function, and then perform the conversion to a string on the smaller result.

1 Like

Hi Steven thx for the update.
I’m now able to check the pushdown after each query preview and make sure it is pushing the maximum of the computation.
Could you just give me your point of view on this.
I ran this query with a group by and a filter in the where clause.
SELECT
“UC”,
“Continent”,
“Country”,
“Device”,
count(*)
FROM “ES GCS Data”.“gge-gcsdata-index”.generictype
where “Country”=‘france’
GROUP BY
“UC”,
“Continent”,
“Country”,
“Device”

And notice Dremio does not push down the group by

[{
“from” : 0,
“size” : 1000,
“query” : {
“match” : {
“Country” : {
“query” : “france”,
“type” : “boolean”
}
}
}
}]

So he’s doing the group by by itself.

Diging deeper i noticed that if i filter the query

  • with a character field in the where clause
  • and that field is also in the group by section
    dremio won’t push down the group by to ES

If i keep filtering

  • with a character field in the where clause
  • and that field is not in the group by section
    dremio is pushing down the group by to ES like with the following query

SELECT
“UC”,
“Continent”,
“Device”,
count(*)
FROM “ES GCS Data”.“gge-gcsdata-index”.generictype
where “Country”=‘france’
GROUP BY
“UC”,
“Continent”,
“Device”

[{
“size” : 0,
“query” : {
“match” : {
“Country” : {
“query” : “france”,
“type” : “boolean”
}
}
},
“aggregations” : {
“UC” : {
“terms” : {
“field” : “UC”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Continent” : {
“terms” : {
“field” : “Continent”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
},
“aggregations” : {
“Device” : {
“terms” : {
“field” : “Device”,
“missing” : “NULL_STRING_TAG”,
“size” : 2147483647
}
}
}
}
}
}
}
}]

And i have not that problem when the filter id done on an integer field.