Dremio Reflection Incremental Refresh

Hi
I’m testing Dremio 3.2 reflection feature. According to the documentation, now it’s possible to use new data types as reference fields. During my tests, I had the following problem:

  1. First, I’ve created a reflection on my PDS. I set one DateTime field as my reference field:

Dremio executed successfully the following query:

SELECT `id`, `ultima_atualizacao` AS `$_dremio_$_update_$`
FROM `compra`
  1. However, when an incremental update is running, Dremio executes the following query:
SELECT `id`, `transacao`, `$_dremio_$_update_$`
FROM
    (SELECT 
        **ALL FIELDS IN THE TABLE**
    FROM `compra`) AS `compra`
WHERE `$_dremio_$_update_$` > TIMESTAMP '2019-05-17 16:31:20.000'

Is that right? Was Dremio supposed to do this? This is a poor performance query. It performs a full table scan and, only then, Dremio filter new data. For this query, a better approach would be:

SELECT **REFLECTION FIELDS**
    FROM `compra`
WHERE `reference_field_here` > TIMESTAMP '2019-05-17 16:31:20.000'

Is there a way to make Dremio execute a query like this one above?

3 Likes

I’ve notice that the same behavior occurs for other databases like Redshift.

1 Like

@Paulo_Vasconcellos, if you look at the Final Physical Plan in your final query profile, you’ll see that only the selected columns get pushed down into the final scan of the source table.

To add to Ben’s comment, our optimizer is able to efficiently plan queries like this. In this case, both the column selection and filter will get pushed down. The original SQL is not a good way to see what Dremio is actually doing behind the scenes.

Hi @ben and @steven! Thanks for the reply.

I understand that Dremio will optimize the process to create incremental refreshes, but don’t you think it creates a bottleneck in the data source by running queries like this? I mean, at the end of the day, Dremio will execute a query in the data source to return the new data, right? This query is the one that I mentioned in this thread. For example, this is a visual plan of the query that Dremio is executing in MySQL. Note that Dremio will first materialize the whole table and then execute another query on the result set. It’s a double full table scan:

image

This is the same visual plan for the query that I suggested:

image

This table is not a huge one (~26M rows), but it’s taking too long to refresh the reflection. In this scenario, I’m not interested to know how Dremio will optimize the process behind the scenes (I’m sure that Dremio will execute an excellent job when the new data arrives). My concern is how the data source is been impacted by incremental refreshes. What do you think?

Thank you in advance

1 Like

Hi, guys.
Any thoughts on this?

If you provide a profile for the query (https://www.dremio.com/tutorials/share-query-profile-dremio/) we can take a look to see if the correct pushdown is happening or not.

@doron here’s the reflection’s profile and the load materialization’s profile:

refresh_reflection.zip (15.8 KB)
load_materialization.zip (3.4 KB)

Hi @doron! Any findings on this issue?