For lack of a better description I’m looking for ‘streaming union’ functionality in Dremio.
Let’s say I have data from N sensors which I save on S3 in parquet format. The sensor data of all sensors is sorted by event-time and all share the same schema. Each night new sensor data is loaded in in batches for all sensors. (Append-only.) I probably would have a directory structure on S3 with a vds defined on top for this. (One vds for all sensors with an extra partition on sensor-type or one vds per sensor-type I’m not sure yet)
Given a specific date-range filter I often would like to ‘union all’ two or more sensor event-streams into one, where the resulting stream maintains the same event-time order. So multiple streams come in already sorted by event-time, and the output is the interleaved union of the events sorted by event-time
Can Dremio efficiently handle this case? Especially:
- given that all N inputs and the wanted output all share the same ordering the ‘streaming union’ can be very efficient without needing the entire inputs in memory. Does Dremio make use of that?
- how does Dremio know that my raw data is already ordered by event-time? Is that a trait I need to explicitly communicate?
- if I were to create record-batches (or rather the parquet equivalent) per e.g. hour, could Dremio use this knowledge to effectively prune the inputs based on my daterange filter?
- in an ideal world I would than stream out the resulting ordered union over Apache flight, using record batches, where each ‘eventtime hour’ is one recordbatch. Is there any way to custom define the way in which Dremio splits its output into record batches like this?
Many thanks,
Geert