Hello, I am working with Dremio to mix an S3 data warehouse (parquet) with a postgres metadata database.
I built a view on these two data sources and set an acceleration to it, however this acceleration is almost never used even if it should.
This is the acceleration:
And this is the query, that takes almost 2 minutes every time: SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics WHERE semantic = 'outdoor_temperature'
It is not accelerated even if I select a subset of the acceleration “display” fields and I filter on a partitioned field. Planner says that the acceleration does not match the query (Extract with Semantic (raw): considered, not matched.).
However, after many attempts, I found that this following query, that should give the same output, is accelerated and takes less than a second: SELECT * FROM ( SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics ) "d" WHERE "d".semantic = 'outdoor_temperature'
The planner says that it matches and then use it.
First point, it might be better to sort the “semantic” field, rather than partition it. Sorted fields are used for filters, where-as partitions are more suited for group-bys.
Maybe give that a go first. If you’re still not getting the expected outcome, share the profile of the job that isn’t matching the reflection and I will take a look and see if anything is obvious why it’s not selecting.
It is worth noting that this second reflection (with sorting) took 1h 40m while the first (with partition) took 22m.
However, the first query (without subquery) is still not using any of the two reflections, while the second query (the one with subselect) is still being accelerated by the first reflection, and not by the newly created one, that is considered too expensive.
So, to recap, the point here is to understand why the reflection is selected only when I separate the “SELECT” part from the WHERE part by using a subquery.
Here the queries and the associated profiles, I run them through odbc (pyodbc) on the same machine where dremio runs.
1st query (not accelerated, 13m35s): SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics WHERE semantic = 'outdoor_temperature' e90c0fb6-959c-4715-9c3d-fbc7fea790e6.zip (10.6 KB)
2nd query (accelerated, 18s) SELECT * FROM ( SELECT time_local, semantic, "value" FROM "Tera Test".data_with_semantics ) "d" WHERE "d".semantic = 'outdoor_temperature' 6ef816bf-3593-492c-acce-3b1851a64866.zip (6.7 KB)
Sure, here it is: 02cfd03b-6a2f-45c5-bfdc-89410cdb7e2f.zip (9.4 KB)
2nd query with no active reflections.
I can’t compare the time with the yesterday tests since a few things happened (we had to clear the s3 data) but the data structure is the same (we just renamed “project” column to “domain” in the postgres table).
Hi, if you mean the dremio view definition, here it is:
SELECT data.model_id as model_id, data.model_variable_code, TO_TIMESTAMP(data.time_local) as time_local, TO_TIMESTAMP(data.time_utc) as time_utc, data."value" AS "value", TO_DATE(data.dir0, '"dt="YYYY-MM-DD', 1) AS "date", LTRIM(data.dir1, 'device=') as device, mv.model_variable_id as model_variable_id, sem.semantic_id as semantic, vs.domain_id as domain FROM "DataScience S3"."carel-ds-tera-test".data as data, "DataScience Master".public.ds_model_variable as mv, "DataScience Master".public.ds_variable_semantic as vs, "DataScience Master".public.ds_semantic as sem WHERE data.model_id = mv.model_id and data.model_variable_code = mv.model_variable_code and mv.source_system_id = 0 and mv.model_variable_id = vs.variable_id and vs.semantic_id = sem.semantic_id
We have identified the issue and are working on a fix. In the mean-time the work-around seems to be turning on all display attributes for the reflection that isn’t getting chosen.
@christy, do you have an update on this issue? The problem with the workaround is that (in my case) by adding all columns I end up with a significantly larger reflection and performance is not significantly improved (from 200MB when choosing only the columns I need to 10GB when using the workaround - all columns).