I have an Elasticsearch 5.6 index with 112 million documents wth response times of about a second for different aggregate queries. I am trying to test Dremio and the idea is to use it to join the index with a lookup table with 77 values, stored in S3.
The situation is that after creating the reflection with a few fields, the response time is about 20 seconds which is better than not getting any results as it occurred initially without a reflection, but not nearly the response time that I need. Also, if I add a WHERE clause using a field that exists in the reflection, the reflection is not picked up and the query doesn’t return.
I am probably doing something wrong but I am not sure what it is.
My cluster is one c4.2xlarge instance for the coordinator and one m5.4xlarge instance for the executor.
The Elasticsearch server, which is the underlying source, is a single c5.9xlarge node also using EBS with no IOPS.
The profile of the query that never returns is attached. Also the profile of the query that picks up the reflection but takes over 20 seconds to return is attached.
Your Heap size seems very large. The executor nodes don’t really use much heap, so you can safely turn that down to say 4GB, or even 2GB. The larger the heap, the longer GC can cause things to pause.
For the query that has a WHERE clause in, the Aggregation reflection won’t match, because it is aggregated on the unfiltered set. You may want to consider putting a raw reflection on the virtual data set. If you know you’re going to have commonly used aggregations over specific filters, you can create virtual data sets with that filter specified, and put an aggregation reflection over that
The for query that is getting accelerated, the profile says that there is a 20 second wait before anything starts. I’m trying to figure out why the query is being delayed by 20 seconds before starting… Do you have other queries running? Is there load or contention or something blocking your access to the storage on which the reflections are saved?
A side note: You’re also writing to a file at the end which is taking a second or two, I’m guessing you’re doing this query from the UI via a run. If you were to run this query through ODBC, if won’t cache the result to the file system, which - will speed things up a bit too.
In fact thinking about it, try setting the heap size to say 4GB, and try rerunning the query taking 24 seconds. I want to see if that 20 second wait at the start is consistent. It’s possible you were unlucky and e.g. got stuck waiting on a GC process to finish.
Christy, thank you for your reply and directions. It made sense when you said that the 20 seconds was wait time. My application includes 19 widgets that can trigger simultaneous filters by selecting their visualization elements. The queries were all queued up. I increased the cluster resources and now they all execute within a couple of seconds.
The reflection is now also picked up. It turned out that if I don’t join with the S3 lookup table, Dremio picks up the reflection, providing that the WHERE clause uses a field included in the reflection. I can provide more details in this area if needed.
Kelly, Christy, thanks for following up. I was happy to see my initial queries accelerated and today I tried adding a raw reflection to the virtual data set as Christy suggested and I believe that made the difference for the join queries. The fused dataset is being accelerated as well as queries using DATE_TRUNC() which were not being picked up before. All this is great.
Queries that include a WHERE clause like the one below still take a long time (19s) even though they are accelerated. My goal is to reach 1-2 seconds. I wonder if I should add more processors, provision IOPS or both, or something else.
WHERE (“trip_minutes” BETWEEN 10 AND 1439.98)
I am attaching the profile of the query that takes 19s. On the cluster side, my executor node is a c5.9xlarge and the coordinator is a c4.2xlarge. I am using AWS EBS instances.
Again, there seems to be a 14 second delay between submitting the job and it actually starting. I’m going to guess again, there is a queueing issue going on here
In the reflection, I would suggest:
sorting on trip_minutes column, as this is performing a range-filter, and those will be improved if that column is sorted
partitioning on the pickup_community_area_desc column, as you are grouping by on that.