Does Dremio support a ‘Streaming union’

For lack of a better description I’m looking for ‘streaming union’ functionality in Dremio.

Let’s say I have data from N sensors which I save on S3 in parquet format. The sensor data of all sensors is sorted by event-time and all share the same schema. Each night new sensor data is loaded in in batches for all sensors. (Append-only.) I probably would have a directory structure on S3 with a vds defined on top for this. (One vds for all sensors with an extra partition on sensor-type or one vds per sensor-type I’m not sure yet)

Given a specific date-range filter I often would like to ‘union all’ two or more sensor event-streams into one, where the resulting stream maintains the same event-time order. So multiple streams come in already sorted by event-time, and the output is the interleaved union of the events sorted by event-time

Can Dremio efficiently handle this case? Especially:

  • given that all N inputs and the wanted output all share the same ordering the ‘streaming union’ can be very efficient without needing the entire inputs in memory. Does Dremio make use of that?
  • how does Dremio know that my raw data is already ordered by event-time? Is that a trait I need to explicitly communicate?
  • if I were to create record-batches (or rather the parquet equivalent) per e.g. hour, could Dremio use this knowledge to effectively prune the inputs based on my daterange filter?
  • in an ideal world I would than stream out the resulting ordered union over Apache flight, using record batches, where each ‘eventtime hour’ is one recordbatch. Is there any way to custom define the way in which Dremio splits its output into record batches like this?

Many thanks,
Geert

@gebrits

Looks like ‘streaming union’ is a spark feature. Dremio is SQL engine on data lakes, but lot of companies do use it as a ETL

  • If you are asking about parallelism on the scan it depends on 3 factors #1 Number of cores #2 splits #3 estimated row count
  • If your Parquet file is ordered by a certain field, Dremio will leverage that. Also you can create a reflection and sort/partition by a certain field

https://hello.dremio.com/wp-data-reflections-best-practice-and-overview.html

  • As you get new records on the lake like HDFS, Dremio’s background refresh (frequency can be set) will discover them
  • Yes, Dremio will prune partitions and pushdown filters into the scan so it reads less data
  • I do not think we can define the record batch, but the record batch size can be set

Yeah ‘streaming union’ isn’t the best of terms, since I realize Dremio is not a streaming solution.

  • What I meant was: Is both input dataset are sorted by timestamp and the output dataset is sorted by timestamp as well, the algorithm to implement the sort can be very space and time effcient. Basically, while scanning it can output the correct order right away, instead of having to scan the entire input datasets first. (This resembles operating on streams in a way)

  • Ah so while creating a parquet file you’d probably include in metadata which column is sorted on, so this knowledge can be used?

  • Has being able to define output record bathes by anything other than record count/size come up? I imagine this can be a pretty useful feature in a number of use cases.

Many thanks,