Is dremio suitable for big streaming data?

Hi all

first of all congratulations for a really great piece of infrastructure. The premises are great.

I was wondering if Dremio is meant/suitable for big STREAMING data, for example 1TB a day of streaming logs.

would one (realistically) be able to operate and query the data as its streaming?

so:

  • are there streaming connectors?
  • would Dremio understand when data is appended to a file on HDFS?
  • how about if the data is e.g. in Elasticsearch and grows continuosly and switches index name every day?

I am thinking that if the above is a use case that is supported by Dremio, then it would implement a kind of “splunk” schema last capabilities: reformat the data by using SQL operators , get some results slowly at the beginning and faster later after one reindexes (a.k.a. reflections)

however, to use Dremio in this way should Dremio not offer better streaming results support? i get the idea that Dremio would in many cases simply read the entire original file before returning any results vs giving you “some” first … (or is it a matter of simply putting a “limit”) in a SQL query statement and Dremio would in fact return just a few?

sorry if i may be unclear as i am still investigating this.

thanks in advance.

1 Like

Thanks for the feedback, and keep it coming!

Let me try and answer some of your questions:

  1. can Dremio query streaming data?

Not yet, but we are working on a connector for Kafka, and the ability to UNION over streaming and data at rest. If you have specific requests in this area please let us know.

  1. can Dremio work with data that is growing ~1TB per day? would Dremio understand when data is appended to a file on HDFS?

Yes! The approach depends on the source, and this is actually easier on S3, HDFS, and other file systems. Dremio’s reflections can be updated incrementally, provided the data is append only. Dremio supports many file formats, and can read compressed files.

For rolling windows of time you can define the window in a virtual dataset, then periodically update the VDS to update the window. You could also use a VDS with a reflection for each period of time, such as one per week, then create a “parent” VDS that performs a UNION over these. Separately you would prune off old VDS, add new ones, and update the definition of the parent VDS through a script that calls the Dremio API.

  1. how about if the data is e.g. in Elasticsearch and grows continuously and switches index name every day?

for Elasticsearch, you can create a VDS that queries multiple indexes. So, if you’re creating one per day, you could name each index my_index_2017-09-22 etc, then query them altogether with SELECT * FROM elastic.my_index* or other regular expressions supported by elastic.

We recently added a tutorial to help explain how to query across multiple indexes in Elasticsearch: https://www.dremio.com/tutorials/elasticsearch-sql-query-multiple-indexes/

Hope that helps!

3 Likes

Thanks so much, appreciated.

looking forward to the API documentation to try some scripting or add on based on that.

great news about the Kafka connector also.

Hi Kelly,

Thought I’d just tack onto this thread as I am curious about the status of the Kafka Connector you alluded to. I recall Jacques mentioning it as well at one point.

Cheers,
Lew

1 Like

hi kelly, what is the status of kafka connector or kafka ksql integration?

Hi @kelly, Any progress about Kafka connector?Thanks.

I am interested in hearing about any updates on the Kafka connector as well :slight_smile:

@i-love-doufunao

Dremio Hub is probably a good place to start

Thanks
@balaji.ramaswamy