Reflection used only when sql splitted in subqueries

Hello, I am working with Dremio to mix an S3 data warehouse (parquet) with a postgres metadata database.
I built a view on these two data sources and set an acceleration to it, however this acceleration is almost never used even if it should.
This is the acceleration:
screenshot-localhost%2019047-2018-06-04-11-00-36
And this is the query, that takes almost 2 minutes every time:
SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics WHERE semantic = 'outdoor_temperature'
It is not accelerated even if I select a subset of the acceleration “display” fields and I filter on a partitioned field. Planner says that the acceleration does not match the query (Extract with Semantic (raw): considered, not matched.).

However, after many attempts, I found that this following query, that should give the same output, is accelerated and takes less than a second:
SELECT * FROM ( SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics ) "d" WHERE "d".semantic = 'outdoor_temperature'
The planner says that it matches and then use it.

Why is this?
Thanks

Hey Luca,

First point, it might be better to sort the “semantic” field, rather than partition it. Sorted fields are used for filters, where-as partitions are more suited for group-bys.

Maybe give that a go first. If you’re still not getting the expected outcome, share the profile of the job that isn’t matching the reflection and I will take a look and see if anything is obvious why it’s not selecting.

Christy

Thank you for your reply.
I built a second reflection with sorting option on “semantic” field so now I have two distinct raw reflections:


It is worth noting that this second reflection (with sorting) took 1h 40m while the first (with partition) took 22m.
However, the first query (without subquery) is still not using any of the two reflections, while the second query (the one with subselect) is still being accelerated by the first reflection, and not by the newly created one, that is considered too expensive.

So, to recap, the point here is to understand why the reflection is selected only when I separate the “SELECT” part from the WHERE part by using a subquery.
Here the queries and the associated profiles, I run them through odbc (pyodbc) on the same machine where dremio runs.

1st query (not accelerated, 13m35s):
SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics WHERE semantic = 'outdoor_temperature'
e90c0fb6-959c-4715-9c3d-fbc7fea790e6.zip (10.6 KB)

2nd query (accelerated, 18s)
SELECT * FROM ( SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics ) "d" WHERE "d".semantic = 'outdoor_temperature'
6ef816bf-3593-492c-acce-3b1851a64866.zip (6.7 KB)

Thank you

1 Like

Thanks for supplying the profiles. I’ll take a look and let you know what I find.

Hey Luca,

Can you disable the raw reflection, re-run the 2nd query and upload that profile too.

Thanks

Christy

Sure, here it is:
02cfd03b-6a2f-45c5-bfdc-89410cdb7e2f.zip (9.4 KB)
2nd query with no active reflections.
I can’t compare the time with the yesterday tests since a few things happened (we had to clear the s3 data) but the data structure is the same (we just renamed “project” column to “domain” in the postgres table).

Thanks

Hey Luca,

Thanks, we’re taking a look to try to understand what’s going on.

I’ll keep you updated.

Christy

Hey Luca,

Would you mind sharing the virtual data set definitions used for the query? Support are saying they need these to diagnose the issue.

Christy

Hi, if you mean the dremio view definition, here it is:

SELECT data.model_id as model_id, data.model_variable_code, TO_TIMESTAMP(data.time_local) as time_local, TO_TIMESTAMP(data.time_utc) as time_utc, data."value" AS "value", TO_DATE(data.dir0, '"dt="YYYY-MM-DD', 1) AS "date", LTRIM(data.dir1, 'device=') as device, mv.model_variable_id as model_variable_id, sem.semantic_id as semantic, vs.domain_id as domain FROM "DataScience S3"."carel-ds-tera-test".data as data, "DataScience Master".public.ds_model_variable as mv, "DataScience Master".public.ds_variable_semantic as vs, "DataScience Master".public.ds_semantic as sem WHERE data.model_id = mv.model_id and data.model_variable_code = mv.model_variable_code and mv.source_system_id = 0 and mv.model_variable_id = vs.variable_id and vs.semantic_id = sem.semantic_id

Thanks. I’ll keep you posted!

Hey Luca,

We have identified the issue and are working on a fix. In the mean-time the work-around seems to be turning on all display attributes for the reflection that isn’t getting chosen.

Hope this helps

Christy

@christy, do you have an update on this issue? The problem with the workaround is that (in my case) by adding all columns I end up with a significantly larger reflection and performance is not significantly improved (from 200MB when choosing only the columns I need to 10GB when using the workaround - all columns).

Having the same issue.