I have a dataset that currently contains 16 billion records, and that’s only data from YTD. The data is stored in parquet files that I have stored in a partitioned directory structured, as suggested in the Dremio docs.
As you can probably imagine, I’m looking to aggregate this data to an hourly level to make it easier to work with, and it works fine on a sub-set level, but I’m unable to create a reflection that contains the full aggregation. I’ve tried to set the reflection refresh to just be incremental, but I keep getting out of memory errors, so I gave up.
My current solution is running a python script that first prepares SQL queries for smaller chunks of the dataset and then I send a bunch of concurrent queries via turbodbc, and store away the result as new parquet files, that then becomes a new datasource in Dremio.
I read another question earlier tonight where Kelly from Dremio stated that:
Dremio includes a vectorized Parquet reader that reads directly to Arrow, so you should see some nice benefits when Parquet is the source if you’re seeing reads as the bottleneck (assuming I/O isn’t the limiting factor, of course).
This made me wonder if the out of memory errors I’m getting is because the Parquet reader is trying to ingest everything at the same time? (I’m guessing you probably want some query profiles, I’ll see what I can do)
Next, I started thinking that it would be great to have a little more control over how the query is performed. In my example, I first aggregated the data from seconds to minutes, and then to hours, so I can easily run the queries on daily subsets and not miss any data. The one idea I have now would be to be able to add a command to my SQL that tells the query planner how to run this query in parallel, maybe something like this:
select
dir0,
dir1,
dir2,
dimension1,
dimension2,
count(distinct value)
from dataset
group by
dir0,
dir1,
dir2,
dimension1,
dimension2,
concurrent by
dir0,
dir1,
dir2,
dimension1
It wouldn’t surprise me one bit if some version of this is already being done, but I think it would be a nice addition!
/Niclas