Executing SQL group by queries on Elasticsearch datasources

Chitender_Kumar · May 25, 2018, 9:06am

we are trying to execute group by query on Elasticsearch with “where” clause included in statement. we are not able to get the output.

when we are executing the Group By from dashboard we are able to get output. but when we are getting group by query with where clause through SQL editor. we are not able to get the output. and in background Query keeps running.

kindly suggest

anthony · May 25, 2018, 1:16pm

Can you share the version of Elastic you are on? Can you also download/share a job profile please?

Chitender_Kumar · May 25, 2018, 7:40pm

version of elasticsearch is 2.3.3

kelly · May 26, 2018, 1:54am

Please share the queries as well.

Chitender_Kumar · May 26, 2018, 5:24am

SELECT pipid
FROM Test.pd_activity_v2.jdbc
WHERE vrtm is not null and cltp = ‘vital’
GROUP BY pipid

anthony · May 27, 2018, 1:54pm

That query works fine with me on ES. Can you download/share a job profile of the failed query please?

Chitender_Kumar · May 28, 2018, 4:40am

393d2ce4-ac81-4cb3-9e5c-1774bc430222.zip (11.2 KB)

find attached the query profile.

anthony · May 28, 2018, 1:23pm

Thanks for uploading. However, I see no errors. I also noticed you eventually canceled the job. Did you ever let it finish? There seems to be no issues, I would recommend it continue to run.

Additionally, it seems like most of the time is hung up on the initial first step of retrieving the data from Elastic. This is a bit surprising because it usually should actually take longer without a where clause, not with it. There could be something wrong with underlying data in Elastic. Can you share a job profile of it without a where clause?

Chitender_Kumar · May 28, 2018, 1:46pm

i had to cancel it as it did not gave me output after waiting for an hour. i am guessing issue due to GC cycles which get initiated when i execute the query. the index on which i am executing SQL query is of 290GB and has records above 240Millions. attaching the job profile which i am executing without where clause.5303111d-4fa6-49fe-b21d-fc5f5416a265.zip (9.7 KB)

anthony · May 28, 2018, 2:21pm

Thanks for the info. The first query profile you sent me was canceled within a couple minutes. The second query you sent me is still running.
To clarify - can you send me the query profile for the job that was running for ~hour with a where & a query that successfully completed without a where?

Chitender_Kumar · May 28, 2018, 4:33pm

attaching job profile in which we did not used where clause.0240a978-8cac-4984-995c-3d93f56355be.zip (12.4 KB)

just to mention, we received output in 2Hours which is very high.

Chitender_Kumar · May 28, 2018, 5:51pm

attaching job profile in which we included the where clause
b2c79020-67cd-44a9-83be-3b4f9c51e231.zip (12.7 KB)

anthony · May 28, 2018, 7:51pm

Thanks for the new profiles, exactly what I was looking for. Everything looks normal and I think there may be some confusion with the original post. Let me explain…
To begin with, what I saw in the profiles your just attached-

Includes Where clause: completes successfully, took ~21min, retrieves ~22mil records
Excludes Where clause: completes successfully, took ~2hours, retrieves the full ~250mil records you mentioned earlier

Your original post said you are not able to get an output when you add a Where clause. However, based on your profiles, you actually do get an output successfully. In fact, adding a Where clause speeds up the query from ~2hours to ~21min. This is in line with what I expected and mentioned in my earlier response here. By adding a Where clause, it should always run faster since we are scanning/processing less data. To summarize - everything looks good & normal!

Chitender_Kumar · May 29, 2018, 5:26am

ok, that clears all.

now i am concern about decreasing the turn around time for the group by query which include the where clause.

can you please suggest how can i improve the turn around time?

kelly · May 29, 2018, 12:31pm

I think what you’re seeing is that it takes ES a long time to scan all your data.

To speed things up, you can try creating one or more data reflections: https://docs.dremio.com/acceleration/reflections.html

You can decide if Raw or Aggregation reflections are appropriate. Sometimes you can make use of both. Note that the time to create the reflection the first time will probably be around 2 hours, but then subsequent queries that are accelerated should be much faster.

Here’s a tutorial: https://www.dremio.com/tutorials/getting-started-with-data-reflections/

can · May 30, 2018, 5:33pm

@Chitender_Kumar is pipid normalized/analyzed? Also what’s the data type?

Chitender_Kumar · May 31, 2018, 10:38am

hi,

pipid is analyzed and it is string.

can · May 31, 2018, 6:14pm

When looking at the query profile you’ve provided, we can see that Dremio pushes down the filters on vrtm and cltp into ES, but not the aggregations on the analyzed field (expected) – which is probably why this is taking longer than you expect. We end up having to read the data after the filters are applied before we can do teh aggregation in Dremio. As @anthony mentioned, the rate at which ES can return data is the bottleneck.

In general, Dremio doesn’t pushdown aggregations if the field is analyzed or normalized for correctness reasons. Imagine you have a value in this field for “Los Angeles”. When analyzed, this may be split into “los” and “angeles”. At this point, if we were to pushdown aggregation on this field instead of grouping on “Los Angeles”, you’d be grouping by both on “los” and “angeles” separately.

If this is a common use-case where you need performance, I’d follow @kelly’s suggestion above of using reflections.

Chitender_Kumar · May 31, 2018, 7:39pm

so by reflection means Dremio will hold the mirror copy of Elastic Indices. if yes, then i do need to update the reflections when ever ES indices get updated. Am i right or wrong?

can · May 31, 2018, 8:04pm

@Chitender_Kumar, raw reflections would hold the full dataset or a subset of fields in the dataset, where as aggregation reflections would hold summarized versions of the dataset. Check the documentation here. Yes, will need to have correct refresh policy defined based on your needs – or you can manually trigger refreshed. Check the documentation here.

Alternatively, you can create a duplicate field (e.g. pipid_not_analyzed) which is not analyzed in ES and use that in your group by instead.

Topic		Replies	Views
Dremio become slow when using where clause	1	1570	February 27, 2019
DSL query executed on ES	1	1087	August 24, 2018
ES Query with timestamp transformation	7	3789	July 5, 2018
Problem with query - Elastic Search 7.10	3	904	July 2, 2022
Elasticsearch SQL ERROR	3	1195	November 20, 2017

Executing SQL group by queries on Elasticsearch datasources

Related topics