I have daily files from the last 10 years from multiple sources which I am joining in Dremio.
The directory structure of each is Name/Year/Month/DailyFiles.parquet
I tried promoting the root (ie. Name) of each and doing one big join but the performance was unusable, even when trying to build reflections. (The reflection build would take hours/spilling/running out of memory).
I’ve since split the data up, promoting at the Year level and I am happy with the performance of querying each output vds (reflections working nicely).
I have set up a VDS which UNION ALL’s each of the resulting (YEAR joined) datasets but when I query against this (either for all dates, or even a small range of dates) then none of the aggregate reflections on the underlying are being hit and instead it is hitting the raw reflections on each which is considerably slower.
This raises three questions:
- Is there a way to configure dremio to make it aware of unions such that a particular dataset doesnt need to be scanned if the date isn’t in it’s range?
- Why is it missing each of the aggregate reflections on the underlying vds’? The profile states “Did not cover query” but I am GROUPING on the only 2 dimensions that are in the VDS aggr reflection (date being one of them). The same query on one of the single datasets referenced in the UNION ALL DOES hit the reflection and returns quickly.
- I’m seeing “Arrow Flight Client (create prepared statement)” taking ~ 10s to do the planning now - is this a result of each of the underlying datasets needing to be planned separately?
Example dataset setup:
- ASTATIC_2017 (A_2017 joined with STATIC)
- ASTATIC_2018 ("")
- ASTATIC_2019 ("")
- RESULT_2017 (ASTATIC_2017 joined with B_2017)
- RESULT_2018 ("")
- RESULT_2019 ("")
- FINAL (RESULT_2017 UNION ALL RESULT_2018 UNION ALL RESULT_2019).
RESULT_ all have Aggregate Reflections on Date, plus one additional dimension d (sorted by date) with multiple SUM aggregations.
I have tried changing the UNION ALL to be:
… UNION ALL SELECT * FROM RESULT_2018 WHERE date BETWEEN ‘2018-01-01’ AND ‘2018-12-31’ …
But this does not seem to help.
@Saltxwater Are you able to provide the job profile that did not use the Aggregate reflection?
33a5e8d6-edd2-4683-aec9-e8fc2523850a.zip (2.6 MB)
2f893f5a-2315-44e8-ae7a-38e6ae8dd50e.zip (217.4 KB)
Here are 2 query profiles showing the aggregate reflections not being hit with UNION
2f89… is a query on a single year, and hits the aggregate reflection “EwmaVar_v4_2010_Joined_Prism_Fund”
33a5… is a query which unions on the same year as above but also includes multiple other years. In this case, “EwmaVar_v4_2010_Joined_Raw” is hit (my raw reflection). As well as multiple other raw reflections for other years. Only 2018 and 2019 used the faster aggregate reflections.
Through repeated testing it seems that with only a few years in the UNION, the expected aggregate reflection is hit. It then hits a point where aggregate reflections are ignored and RAW reflections are used. When the raw reflections are used the query takes significantly longer (we ran one one 11 years data and it took 27 minutes to complete) but querying the years individually and hitting the aggr reflections takes ~ 19seconds.
Not directly related, the planning time when using UNION is very long, often around 30 seconds, whereas planning for a single year is just a couple of seconds.
I do have verbose query profiles available but these are too large for the 4MB maximum upload. I will be happy to share them with Dremio support over email if requested. (pm me).
@Saltxwater The issue with “33a5” is that you are hitting a “Substitution terminated after timeout of 30000 millisecond” which may have happened before the Agg got substituted. There are many reflections that match this query and the substitution process is > 30s. I also see increase planning times due to many reflections. Here is white paper that talks about reflection best practices
Thank you for explaining that. I only have 2 aggregate reflections (and 1 raw reflection), but on each of the years datasets. And given that the 30s planning limit applies to the query as a whole this only gives a couple of seconds per underlying dataset (individual union’ed set).
All the datasets and reflections are identical but being applied on a specific year’s worth of data. Are there any white papers going into the best way to set up daily data like this?
Joining a years worth of data with another years worth on a given date (and other identifier) seemed to take 4 times as long as doing 6 months worth… My current setup doesn’t scale nicely. We’re even considering going deeper than annual promoted datasets, potentially to 6 monthly / quarterly, but then the union problem will be even worse.
@Saltxwater Is it not possible to avoid the UNION but instead have a partition column and prune based on the filter condition?
If you could point me at an example of how this would work then I’ll happily look at trying it and report my success!
Would this involve promoting the full directory of underlying files rather than the years?
We’re building up the datasets from some JOINs on dates (+ other criteria).
The datasets are currently set up as TYPE / YEAR / MONTH / DATE_TYPE.parquet
In which case it would become:
FROM Promoted.A a
INNER JOIN Promoted.B b ON a.dir0 = b.dir0 AND a.dir1 = b.dir1 AND a.date = b.date
I tried this previously and the overall performance seemed to be much slower than doing the joins by separate year. (this brings in that 12 months of data seemed to take 4 times as long as 6 months).
But if you’d expect this approach to have the expected scalable performance then I’ll happily revisit and let you know what I find.
What would the partition column be at this point? The dir0, dir1 and date?
Our queries are all specifying a date at the moment but we could add in dir0, dir1 if that helps dremio.
Or perhaps I’m missing your point about partition columns and there’s something I don’t understand?
Thank you for your help @balaji.ramaswamy
Aside from the scalability of reflection building not being what we’d hoped for, this approach does become more costly when we add in new daily data… the full reflections would need to be rebuilt since the presence of the JOIN prevents incremental refresh. The current cost of this is only 1 years worth of reflection building which is relatively quick
Our current workaround is to query individual years and UNIONing the results on the client side. This results in ~ 10 queries just for a particular dataset but the overhead of this is much less than the cost of planning + potentially missing an aggregate reflection.
Would you say this is a reasonable workaround we can use going forward?