Dremio Reflection Incremental Refresh

Hi
I’m testing Dremio 3.2 reflection feature. According to the documentation, now it’s possible to use new data types as reference fields. During my tests, I had the following problem:

  1. First, I’ve created a reflection on my PDS. I set one DateTime field as my reference field:

Dremio executed successfully the following query:

SELECT `id`, `ultima_atualizacao` AS `$_dremio_$_update_$`
FROM `compra`
  1. However, when an incremental update is running, Dremio executes the following query:
SELECT `id`, `transacao`, `$_dremio_$_update_$`
FROM
    (SELECT 
        **ALL FIELDS IN THE TABLE**
    FROM `compra`) AS `compra`
WHERE `$_dremio_$_update_$` > TIMESTAMP '2019-05-17 16:31:20.000'

Is that right? Was Dremio supposed to do this? This is a poor performance query. It performs a full table scan and, only then, Dremio filter new data. For this query, a better approach would be:

SELECT **REFLECTION FIELDS**
    FROM `compra`
WHERE `reference_field_here` > TIMESTAMP '2019-05-17 16:31:20.000'

Is there a way to make Dremio execute a query like this one above?

3 Likes

I’ve notice that the same behavior occurs for other databases like Redshift.

1 Like

@Paulo_Vasconcellos, if you look at the Final Physical Plan in your final query profile, you’ll see that only the selected columns get pushed down into the final scan of the source table.

To add to Ben’s comment, our optimizer is able to efficiently plan queries like this. In this case, both the column selection and filter will get pushed down. The original SQL is not a good way to see what Dremio is actually doing behind the scenes.

Hi @ben and @steven! Thanks for the reply.

I understand that Dremio will optimize the process to create incremental refreshes, but don’t you think it creates a bottleneck in the data source by running queries like this? I mean, at the end of the day, Dremio will execute a query in the data source to return the new data, right? This query is the one that I mentioned in this thread. For example, this is a visual plan of the query that Dremio is executing in MySQL. Note that Dremio will first materialize the whole table and then execute another query on the result set. It’s a double full table scan:

image

This is the same visual plan for the query that I suggested:

image

This table is not a huge one (~26M rows), but it’s taking too long to refresh the reflection. In this scenario, I’m not interested to know how Dremio will optimize the process behind the scenes (I’m sure that Dremio will execute an excellent job when the new data arrives). My concern is how the data source is been impacted by incremental refreshes. What do you think?

Thank you in advance

1 Like

Hi, guys.
Any thoughts on this?

If you provide a profile for the query (https://www.dremio.com/tutorials/share-query-profile-dremio/) we can take a look to see if the correct pushdown is happening or not.

@doron here’s the reflection’s profile and the load materialization’s profile:

refresh_reflection.zip (15.8 KB)
load_materialization.zip (3.4 KB)

Hi @doron! Any findings on this issue?

Hi guys! Any findings?

Hi @Paulo_Vasconcellos,

In the REFRESH REFLECTION profile, go to the “Planning” tab and scroll down to the “Final Physical Transformation” window and look at the JDBC_SUB_SCAN. This shows what query is being executed in the source database. Specifically:

SELECT
transacao,
id,
$_dremio_$_update_$
FROM
(SELECT
id,
transacao,
data_pedido,
data_liberacao,
status,
valor_compra,
observacao_cliente,
engine_pagamento,
afiliacao,
item_estoque,
comprador,
tipo_pagamento,
qtd_atualizacoes_sistema_pagamentos,
tarifa_percentual_marketplace,
tarifa_fixa_marketplace,
data_alerta_vencimento_boleto,
email_comissoes_enviado,
email_comprador_enviado,
tarifa_engine_pagamento_calculada,
tarifa_fixa_cobrada_engine_pagamento,
tarifa_percentual_cobrada_engine_pagamento,
chave,
origem,
data_envio_renovacao_oferta,
email_renovacao_oferta_enviado,
data_retentiva_venda,
data_reenvio,
version,
afiliacao_por_indicacao_de_outro_produto,
codigo_externo,
retentativas_de_entrega,
metodo_pagamento,
numero_parcelas,
codigo_reimpressao_boleto,
ultima_atualizacao,
data_inclusaobi,
log_info,
id_recorrencia,
analise_instantanea,
ip_comprador,
enviou_pagamento,
email_compra_cancelada_enviado,
url_download,
origem_sck,
checkout_mode,
parcelamento_fixo,
valor_parcela,
valor_total,
identificacao_afiliacao,
id_widget_form,
id_exchange_order,
conversion_rate,
currency_code_from,
currency_code_to,
is_payment_captured,
date_payment_captured,
tem_afiliacoes_extras,
warranty_refund,
billet_expiration_date,
merchant_account,
date_chargeback,
date_refund,
ultima_atualizacao AS $_dremio_$_update_$
FROM
marketplace.compra) AS compra
WHERE
$_dremio_$_update_$ > TIMESTAMP ‘2019-06-10 16:20:01.000’

If you then take a look at the Query tab and look at the JDBC_SUB_SCAN metrics. You’ll see the “setup time” is 15 minutes (!). This actually includes the time it takes within the database to execute the pushdown query… The result of this query only includes 1,228 records. So, to answer your question, Dremio is pushing down the filter and returning a small results set, but it’s taking a long time for some reason. If you run the above push down directly against the database (outside of Dremio), does it take a long time?

Hi @ben! Indeed, the query takes about 15 minutes to complete. But, if we take a look at the execution plan at the source, we’ll see that this query will execute a full table scan, materialize it, and then, perform another full table scan in the materialized table (you can see this plan in the early messages).

In this case, execute a full reflection update took less time than the incremental refresh (~10 minutes).

A much better approach to this problem is to address the incremental query as follow:

SELECT **REFLECTION FIELDS**
    FROM `compra`
WHERE `reference_field_here` > TIMESTAMP '2019-05-17 16:31:20.000'

In this example, there’s no need for a full table scan in the source table. I know that this might not be that simple to implement it within Apache Drill, but I think that’s the better approach;

In the cases you’ve mentioned in this thread, the SQL for the query that is pushed down into the source will evaluate such that only the required columns are included in the SELECT clause and the filter is applied.

I do not see a “full table scan” in any of the examples you gave.

Can you give an example profile of such a table scan where you believe it is occurring?