Does Dremio support a ‘Streaming union’

gebrits · February 26, 2021, 7:44am

For lack of a better description I’m looking for ‘streaming union’ functionality in Dremio.

Let’s say I have data from N sensors which I save on S3 in parquet format. The sensor data of all sensors is sorted by event-time and all share the same schema. Each night new sensor data is loaded in in batches for all sensors. (Append-only.) I probably would have a directory structure on S3 with a vds defined on top for this. (One vds for all sensors with an extra partition on sensor-type or one vds per sensor-type I’m not sure yet)

Given a specific date-range filter I often would like to ‘union all’ two or more sensor event-streams into one, where the resulting stream maintains the same event-time order. So multiple streams come in already sorted by event-time, and the output is the interleaved union of the events sorted by event-time

Can Dremio efficiently handle this case? Especially:

given that all N inputs and the wanted output all share the same ordering the ‘streaming union’ can be very efficient without needing the entire inputs in memory. Does Dremio make use of that?
how does Dremio know that my raw data is already ordered by event-time? Is that a trait I need to explicitly communicate?
if I were to create record-batches (or rather the parquet equivalent) per e.g. hour, could Dremio use this knowledge to effectively prune the inputs based on my daterange filter?
in an ideal world I would than stream out the resulting ordered union over Apache flight, using record batches, where each ‘eventtime hour’ is one recordbatch. Is there any way to custom define the way in which Dremio splits its output into record batches like this?

Many thanks,
Geert

balaji.ramaswamy · February 27, 2021, 6:31am

@gebrits

Looks like ‘streaming union’ is a spark feature. Dremio is SQL engine on data lakes, but lot of companies do use it as a ETL

If you are asking about parallelism on the scan it depends on 3 factors #1 Number of cores #2 splits #3 estimated row count
If your Parquet file is ordered by a certain field, Dremio will leverage that. Also you can create a reflection and sort/partition by a certain field

https://hello.dremio.com/wp-data-reflections-best-practice-and-overview.html

As you get new records on the lake like HDFS, Dremio’s background refresh (frequency can be set) will discover them
Yes, Dremio will prune partitions and pushdown filters into the scan so it reads less data
I do not think we can define the record batch, but the record batch size can be set

gebrits · February 27, 2021, 2:28pm

Yeah ‘streaming union’ isn’t the best of terms, since I realize Dremio is not a streaming solution.

What I meant was: Is both input dataset are sorted by timestamp and the output dataset is sorted by timestamp as well, the algorithm to implement the sort can be very space and time effcient. Basically, while scanning it can output the correct order right away, instead of having to scan the entire input datasets first. (This resembles operating on streams in a way)
Ah so while creating a parquet file you’d probably include in metadata which column is sorted on, so this knowledge can be used?
Has being able to define output record bathes by anything other than record count/size come up? I imagine this can be a pretty useful feature in a number of use cases.

Many thanks,

Topic		Replies	Views
Is dremio suitable for big streaming data?	10	3974	March 27, 2021
Dremio seems to be scanning all the partitoned data from s3	5	1590	December 14, 2022
S3 select PARQUET Dremio + MinIO	1	1390	August 2, 2021
Understanding how the parquet reader works and a suggestion	10	3427	May 23, 2018
UNION ALL not using aggregate reflections of underlying datasets	8	1227	November 15, 2021

Does Dremio support a ‘Streaming union’

Related topics