Dremio Reflection Incremental Refresh

Paulo_Vasconcellos · May 17, 2019, 9:36pm

Hi
I’m testing Dremio 3.2 reflection feature. According to the documentation, now it’s possible to use new data types as reference fields. During my tests, I had the following problem:

First, I’ve created a reflection on my PDS. I set one DateTime field as my reference field:

image.png1093×566 36.2 KB

Dremio executed successfully the following query:

SELECT `id`, `ultima_atualizacao` AS `$_dremio_$_update_$`
FROM `compra`

However, when an incremental update is running, Dremio executes the following query:

SELECT `id`, `transacao`, `$_dremio_$_update_$`
FROM
    (SELECT 
        **ALL FIELDS IN THE TABLE**
    FROM `compra`) AS `compra`
WHERE `$_dremio_$_update_$` > TIMESTAMP '2019-05-17 16:31:20.000'

Is that right? Was Dremio supposed to do this? This is a poor performance query. It performs a full table scan and, only then, Dremio filter new data. For this query, a better approach would be:

SELECT **REFLECTION FIELDS**
    FROM `compra`
WHERE `reference_field_here` > TIMESTAMP '2019-05-17 16:31:20.000'

Is there a way to make Dremio execute a query like this one above?

Paulo_Vasconcellos · May 20, 2019, 2:20pm

I’ve notice that the same behavior occurs for other databases like Redshift.

ben · May 30, 2019, 9:16pm

@Paulo_Vasconcellos, if you look at the Final Physical Plan in your final query profile, you’ll see that only the selected columns get pushed down into the final scan of the source table.

steven · May 30, 2019, 9:21pm

To add to Ben’s comment, our optimizer is able to efficiently plan queries like this. In this case, both the column selection and filter will get pushed down. The original SQL is not a good way to see what Dremio is actually doing behind the scenes.

Paulo_Vasconcellos · June 3, 2019, 1:43pm

Hi @ben and @steven! Thanks for the reply.

I understand that Dremio will optimize the process to create incremental refreshes, but don’t you think it creates a bottleneck in the data source by running queries like this? I mean, at the end of the day, Dremio will execute a query in the data source to return the new data, right? This query is the one that I mentioned in this thread. For example, this is a visual plan of the query that Dremio is executing in MySQL. Note that Dremio will first materialize the whole table and then execute another query on the result set. It’s a double full table scan:

This is the same visual plan for the query that I suggested:

This table is not a huge one (~26M rows), but it’s taking too long to refresh the reflection. In this scenario, I’m not interested to know how Dremio will optimize the process behind the scenes (I’m sure that Dremio will execute an excellent job when the new data arrives). My concern is how the data source is been impacted by incremental refreshes. What do you think?

Thank you in advance

Paulo_Vasconcellos · June 10, 2019, 2:55pm

Hi, guys.
Any thoughts on this?

doron · June 10, 2019, 3:00pm

If you provide a profile for the query (https://www.dremio.com/tutorials/share-query-profile-dremio/) we can take a look to see if the correct pushdown is happening or not.

Paulo_Vasconcellos · June 10, 2019, 5:11pm

@doron here’s the reflection’s profile and the load materialization’s profile:

refresh_reflection.zip (15.8 KB)
load_materialization.zip (3.4 KB)

Paulo_Vasconcellos · June 19, 2019, 9:36pm

Hi @doron! Any findings on this issue?

Paulo_Vasconcellos · July 26, 2019, 12:58pm

Hi guys! Any findings?

ben · July 26, 2019, 6:55pm

Hi @Paulo_Vasconcellos,

In the REFRESH REFLECTION profile, go to the “Planning” tab and scroll down to the “Final Physical Transformation” window and look at the JDBC_SUB_SCAN. This shows what query is being executed in the source database. Specifically:

SELECT
transacao,
id,
$_dremio_$_update_$
FROM
(SELECT
id,
transacao,
data_pedido,
data_liberacao,
status,
valor_compra,
observacao_cliente,
engine_pagamento,
afiliacao,
item_estoque,
comprador,
tipo_pagamento,
qtd_atualizacoes_sistema_pagamentos,
tarifa_percentual_marketplace,
tarifa_fixa_marketplace,
data_alerta_vencimento_boleto,
email_comissoes_enviado,
email_comprador_enviado,
tarifa_engine_pagamento_calculada,
tarifa_fixa_cobrada_engine_pagamento,
tarifa_percentual_cobrada_engine_pagamento,
chave,
origem,
data_envio_renovacao_oferta,
email_renovacao_oferta_enviado,
data_retentiva_venda,
data_reenvio,
version,
afiliacao_por_indicacao_de_outro_produto,
codigo_externo,
retentativas_de_entrega,
metodo_pagamento,
numero_parcelas,
codigo_reimpressao_boleto,
ultima_atualizacao,
data_inclusaobi,
log_info,
id_recorrencia,
analise_instantanea,
ip_comprador,
enviou_pagamento,
email_compra_cancelada_enviado,
url_download,
origem_sck,
checkout_mode,
parcelamento_fixo,
valor_parcela,
valor_total,
identificacao_afiliacao,
id_widget_form,
id_exchange_order,
conversion_rate,
currency_code_from,
currency_code_to,
is_payment_captured,
date_payment_captured,
tem_afiliacoes_extras,
warranty_refund,
billet_expiration_date,
merchant_account,
date_chargeback,
date_refund,
ultima_atualizacao AS $_dremio_$_update_$
FROM
marketplace.compra) AS compra
WHERE
$_dremio_$_update_$ > TIMESTAMP ‘2019-06-10 16:20:01.000’

If you then take a look at the Query tab and look at the JDBC_SUB_SCAN metrics. You’ll see the “setup time” is 15 minutes (!). This actually includes the time it takes within the database to execute the pushdown query… The result of this query only includes 1,228 records. So, to answer your question, Dremio is pushing down the filter and returning a small results set, but it’s taking a long time for some reason. If you run the above push down directly against the database (outside of Dremio), does it take a long time?

Paulo_Vasconcellos · July 29, 2019, 1:56pm

Hi @ben! Indeed, the query takes about 15 minutes to complete. But, if we take a look at the execution plan at the source, we’ll see that this query will execute a full table scan, materialize it, and then, perform another full table scan in the materialized table (you can see this plan in the early messages).

In this case, execute a full reflection update took less time than the incremental refresh (~10 minutes).

A much better approach to this problem is to address the incremental query as follow:

SELECT **REFLECTION FIELDS**
    FROM `compra`
WHERE `reference_field_here` > TIMESTAMP '2019-05-17 16:31:20.000'

In this example, there’s no need for a full table scan in the source table. I know that this might not be that simple to implement it within Apache Drill, but I think that’s the better approach;

ben · July 29, 2019, 5:48pm

In the cases you’ve mentioned in this thread, the SQL for the query that is pushed down into the source will evaluate such that only the required columns are included in the SELECT clause and the filter is applied.

I do not see a “full table scan” in any of the examples you gave.

Can you give an example profile of such a table scan where you believe it is occurring?

Topic		Replies	Views
Reflection Refresh Behavior	1	2338	September 16, 2020
Incremental reflection refresh	1	1307	December 25, 2019
Reflection and datasource scan	11	1134	May 21, 2021
How to get “Identify new records using the field” appear? Dremio University	9	2962	October 17, 2019
Acceleration refresh policies, always shows expired	11	2598	May 17, 2018

Dremio Reflection Incremental Refresh

Related topics