I’m trying to use accelerations to speed-up query execution time, but I have a hard time understanding what is going on. Here is my set-up:
Dremio runs on an industrial PC (windows 10) with 256 GB RAM + 500 GB drive
I have one NAS source which contains a collection of parquet files :
My fact table is 1 folder with 200 partitionned files (450 millions records), whose columns are essentially a bunch of IDs and a few measures.
Then I have 5/6 “decoding” tables that are generally fairly wide (up to 50 columns) and occasionally pretty long as well (the smallest has 200 records, the largest 30 millions)
I have created a logical view MY_VIEW with all the joins whose definition is something like:
SELECT facts., decod1., decod2., decod3., decod4.*
LEFT JOIN decod1 ON = facts.DECOD1_ID = decod1.ID
LEFT JOIN decod2 ON = facts.DECOD2_ID = decod2.ID
LEFT JOIN decod3 ON = facts.DECOD3_ID = decod3.ID
LEFT JOIN decod4 ON = facts.DECOD4_ID = decod4.ID
I have tried to accelerate this view but I ran out of memory (I know I’ll have to go for aggregate reflections, but I’m not sure yet which fields I should put in)
Then I tried to execute the following query:
SELECT dim1, sum(measure1) FROM MY_VIEW GROUP BY dim1
which runs in about 2 minutes. I was somehow in the hope that, since I have left joins everywhere, the query planner would do the necessary to keep IO as low as possible.
Now when I run the query not using the view:
SELECT decod1.dim1, sum(facts.measure1)
FROM facts LEFT JOIN decod1 ON facts.DECOD1_ID = decod1.ID
GROUP BY dim1
which runs in 6s
Then I thought it may be that the query planner can’t read properly my files metadata (because the parquet files have been generated using pyarrow), so I’d try to accelerate separately each table (de fined as a view SELECT * FROM facts for instance). It absolutely didn’t help improve the query time.
Hence my questions :
- I’ve noticed that the slow version of the query had many accelerations to choose from, but ultimately chose just one (qualifying the others as too expensive). However I would have guessed that using these accelerations, while expensive, whould generally cost less than reading the source file. Has Dremio really the ability to use several accelerations to serve a query (so far I’ve never seen that happen) as stated in your doc, or is it something on your roadmap ?
- Is it any useful to create one for one accelerations on my source data (in the end if I understand correctly Dremio acceleration create under the hood the same kind of parquet files, except that you may have forked early on from Arrow to match Dremio specific requirements) ?
- Side question by the way, do you use any compression codec for the accelerations ? Is it configurable ?
- And finally, do you understand why my initial query is that slow compared the other one (as they are functionally equivalent and yield the same result)
I enclose the query profile which I assume should be helpful if you have time to investigate
Romain7c01fddf-984e-4837-8eec-110def8ae800.zip (86.4 KB)