Executing SQL group by queries on Elasticsearch datasources

Can you share the version of Elastic you are on? Can you also download/share a job profile please?

version of elasticsearch is 2.3.3

Please share the queries as well.

SELECT pipid
FROM Test.pd_activity_v2.jdbc
WHERE vrtm is not null and cltp = ‘vital’
GROUP BY pipid

That query works fine with me on ES. Can you download/share a job profile of the failed query please?

393d2ce4-ac81-4cb3-9e5c-1774bc430222.zip (11.2 KB)

find attached the query profile.

Thanks for uploading. However, I see no errors. I also noticed you eventually canceled the job. Did you ever let it finish? There seems to be no issues, I would recommend it continue to run.

Additionally, it seems like most of the time is hung up on the initial first step of retrieving the data from Elastic. This is a bit surprising because it usually should actually take longer without a where clause, not with it. There could be something wrong with underlying data in Elastic. Can you share a job profile of it without a where clause?

i had to cancel it as it did not gave me output after waiting for an hour. i am guessing issue due to GC cycles which get initiated when i execute the query. the index on which i am executing SQL query is of 290GB and has records above 240Millions. attaching the job profile which i am executing without where clause.5303111d-4fa6-49fe-b21d-fc5f5416a265.zip (9.7 KB)

Thanks for the info. The first query profile you sent me was canceled within a couple minutes. The second query you sent me is still running.
To clarify - can you send me the query profile for the job that was running for ~hour with a where & a query that successfully completed without a where?

attaching job profile in which we did not used where clause.0240a978-8cac-4984-995c-3d93f56355be.zip (12.4 KB)

just to mention, we received output in 2Hours which is very high.

attaching job profile in which we included the where clause
b2c79020-67cd-44a9-83be-3b4f9c51e231.zip (12.7 KB)

Thanks for the new profiles, exactly what I was looking for. Everything looks normal and I think there may be some confusion with the original post. Let me explain…
To begin with, what I saw in the profiles your just attached-

  1. Includes Where clause: completes successfully, took ~21min, retrieves ~22mil records
  2. Excludes Where clause: completes successfully, took ~2hours, retrieves the full ~250mil records you mentioned earlier

Your original post said you are not able to get an output when you add a Where clause. However, based on your profiles, you actually do get an output successfully. In fact, adding a Where clause speeds up the query from ~2hours to ~21min. This is in line with what I expected and mentioned in my earlier response here. By adding a Where clause, it should always run faster since we are scanning/processing less data. To summarize - everything looks good & normal!

ok, that clears all.

now i am concern about decreasing the turn around time for the group by query which include the where clause.

can you please suggest how can i improve the turn around time?

I think what you’re seeing is that it takes ES a long time to scan all your data.

To speed things up, you can try creating one or more data reflections: https://docs.dremio.com/acceleration/reflections.html

You can decide if Raw or Aggregation reflections are appropriate. Sometimes you can make use of both. Note that the time to create the reflection the first time will probably be around 2 hours, but then subsequent queries that are accelerated should be much faster.

Here’s a tutorial: https://www.dremio.com/tutorials/getting-started-with-data-reflections/

@Chitender_Kumar is pipid normalized/analyzed? Also what’s the data type?


pipid is analyzed and it is string.

When looking at the query profile you’ve provided, we can see that Dremio pushes down the filters on vrtm and cltp into ES, but not the aggregations on the analyzed field (expected) – which is probably why this is taking longer than you expect. We end up having to read the data after the filters are applied before we can do teh aggregation in Dremio. As @anthony mentioned, the rate at which ES can return data is the bottleneck.

In general, Dremio doesn’t pushdown aggregations if the field is analyzed or normalized for correctness reasons. Imagine you have a value in this field for “Los Angeles”. When analyzed, this may be split into “los” and “angeles”. At this point, if we were to pushdown aggregation on this field instead of grouping on “Los Angeles”, you’d be grouping by both on “los” and “angeles” separately.

If this is a common use-case where you need performance, I’d follow @kelly’s suggestion above of using reflections.

so by reflection means Dremio will hold the mirror copy of Elastic Indices. if yes, then i do need to update the reflections when ever ES indices get updated. Am i right or wrong?

@Chitender_Kumar, raw reflections would hold the full dataset or a subset of fields in the dataset, where as aggregation reflections would hold summarized versions of the dataset. Check the documentation here. Yes, will need to have correct refresh policy defined based on your needs – or you can manually trigger refreshed. Check the documentation here.

Alternatively, you can create a duplicate field (e.g. pipid_not_analyzed) which is not analyzed in ES and use that in your group by instead.

In case of fields with multiple configuration, is there a way to address the not_analyzed (aka .keyword) one?