How are records searched if they are not in a "reflection"?

Hi team,

I would like to understand more about how Dremio searches reflected data sets that are “live” and incrementing continuously, even though we only run reflections at set intervals.

To illustrate, if I have daily indices in my Elasticsearch cluster:

Index # 1. logstash-2019-04-26 (50GB, 85 million docs)
Index # 2. logstash-2019-04-27 (40GB, 75 million docs)
Index # 3. logstash-2019-04-28 (20GB, 30 million docs … and counting)

Because indices 1 and 2 are already “closed” any reflections created on those are permanent and dremio can setup the reflection for the entire dataset/index.

But If i have a job that runs the reflection at 6 AM for example of April 28, since the index is growing continuously, how does Dremio search the newer items that are not yet in the reflection for that index?

Will it:

  1. Only use reflections for indices 1 and 2, and when searching index 3 it will not use reflections as it is not yet complete

or

  1. It has already partially created a reflection of Index # 3, and will still use that to optimize queries, and then just search the remaining un-reflected records normally.

If it is #2, is iDremio actually keeping tabs of which records are already in the reflection? so at any given time, it knows which records are just newly created or updated?

Thanks very much.

Hi @rleyba

The answer slightly varies when it is a file system based source like ADLS/S3/HDFS/Hive Versud something we pushdown like Oracle or your case Elasticsearch

Lets take an example of a simple table in Hive called reftest like below

create table reftest (col1 int, col2 varchar(20));

Let us first insert two rows

insert into reftest values (1,‘A’);
insert into reftest values (2,‘B’);

When we create a reflection on top of this, it will use the reflection and return two rows

Now if we go and add 2 rows into Hive like below

insert into reftest values (3,‘C’);
insert into reftest values (4,‘D’);

If we run the query now Dremio will not know there are 2 additional files to read on HDFS, so if run the query on the same table we will still return only 2 rows and use the reflection. If we disable the reflection Dremio would still return only two 2 rows

Now if we go refresh the reflection and run the query again Dremio would still return 2 rows using the reflection

Now as a next step we have to run the below command,

“alter pds reftest refresh metadata”

Now if we do not use reflection, query will return all rows, if we do use the reflection and reflection has not been refreshed, Dremio would use the reflection and return only 2 rows

Refreshing the reflection and then running the query would use the reflection and return all the 4 rows

In case of a RDBMS like Oracle, NoSQL like MongoDB or Elasticsearh were we use push down, the behavior is slightly different

create table reftestora(col1 number, col2 varchar(20));

insert into reftestora values (1,‘A’);
insert into reftestora values (2,‘B’);

Now creating a reflection on reftestora will use the reflection and return both rows

Let us insert two more rows

insert into reftestora values (3,‘C’);
insert into reftestora values (4,‘D’);

If we disable the reflection and run the query, unlike Hive w do not require a PDS metadata refresh as this query will get pushed down and we will get all 4 rows. PDS metadata refresh is only needed if there is a schema level change

No if we turn on the reflection and run the query (reflection has not been refreshed) the query will use the reflection and return only 2 rows

Now refreshing the reflection and the running the query will return all 4 rows

Note: As stated earlier unlike file system based sources a PDS metadata refresh is not needed

So if we add 2 more rows like below

insert into reftestora values (5,‘E’);
insert into reftestora values (6,‘F’);

Now all we need to do is refresh the reflection (no PDS refresh) and the query should accelerate and return all 6 rows

Kindly let us know if you have any other questions

Thanks
@balaji.ramaswamy

2 Likes

Hi Balaji,

Thanks very much for your detailed explanation. So how I understand it is for Elasticsearch, if you are using reflections, the data returned is only as current as the last instance of the reflection “refresh”. In other words if my last refresh was at 6 AM today for my index, and I query it again later that evening, I still won’t see any new records.

I suppose this is reasonable, as it would be unrealistic to have Dremio be doing refreshes in real time.

What I can do then is to make my indices smaller, and run the refreshes more often, so I can get my results as current as possible.

Thanks Again.

Hi @rleyba

Yes, depending on your SLA and data freshness requirements you can tune the refresh period. You can also call the REST API to refresh the reflection right after the data has changed, see REST API documentation to REFRESH REFLECTION

Note: The {id} is the id of the PDS and not the reflection ID

Thanks
@balaji.ramaswamy