I’m testing out dremio’s applicability for our use case and would like some advice on how to setup our various datasets and reflections.
The source data are parquet files, one file per day per dataset, 2 or 3 datasets. (Currently we’re rebuilding these as we work out what we need / what format, etc. but long term this would be immutable append only).
- I’ve set these as INCREMENTAL refresh but is this necessary or would dremio automatically detect this?
- I’ve added raw reflections on these with partition field “date”.
- ~30,000,000 rows, 100Gb for 10 years data in one (PD1).
- ~60,000,000 rows, 20Gb for 1 year data in another (PD2)
We also have an MSSQL source. This data can be updated, but the tables themselves are only ~10,000 rows so not so concerned about these. (PD3)
Downstream from these physical datasets I’ve constructed some virtual datasets doing various joins between them, performing aggregations on rows.
On my initial testing of this, a raw reflection of this VDS is taking >1 hour joining PD1 with PD2 (and PD2 is going to get much larger as we get more years data in it!).
If this is just a one off when we make changes to the VDS then no problem… but is this reflection going to need to refresh every day (every time we add new daily files to PD1 and PD2)?
I assume this is the case because of the docs: “Reflections on virtual datasets that include joins cannot be incrementally updated. Dremio falls back to using full refresh for these datasets”
We would ultimately want more VDS’s downstream of this, joining with PD3…
What do you recommend for this case? How could I better structure our setup?
Slight aside, we’d also like PIVOT support but from what I’ve read that isn’t supported yet!
Thanks for your help!