first of all congratulations for a really great piece of infrastructure. The premises are great.
I was wondering if Dremio is meant/suitable for big STREAMING data, for example 1TB a day of streaming logs.
would one (realistically) be able to operate and query the data as its streaming?
- are there streaming connectors?
- would Dremio understand when data is appended to a file on HDFS?
- how about if the data is e.g. in Elasticsearch and grows continuosly and switches index name every day?
I am thinking that if the above is a use case that is supported by Dremio, then it would implement a kind of “splunk” schema last capabilities: reformat the data by using SQL operators , get some results slowly at the beginning and faster later after one reindexes (a.k.a. reflections)
however, to use Dremio in this way should Dremio not offer better streaming results support? i get the idea that Dremio would in many cases simply read the entire original file before returning any results vs giving you “some” first … (or is it a matter of simply putting a “limit”) in a SQL query statement and Dremio would in fact return just a few?
sorry if i may be unclear as i am still investigating this.
thanks in advance.
Thanks for the feedback, and keep it coming!
Let me try and answer some of your questions:
- can Dremio query streaming data?
Not yet, but we are working on a connector for Kafka, and the ability to UNION over streaming and data at rest. If you have specific requests in this area please let us know.
- can Dremio work with data that is growing ~1TB per day? would Dremio understand when data is appended to a file on HDFS?
Yes! The approach depends on the source, and this is actually easier on S3, HDFS, and other file systems. Dremio’s reflections can be updated incrementally, provided the data is append only. Dremio supports many file formats, and can read compressed files.
For rolling windows of time you can define the window in a virtual dataset, then periodically update the VDS to update the window. You could also use a VDS with a reflection for each period of time, such as one per week, then create a “parent” VDS that performs a UNION over these. Separately you would prune off old VDS, add new ones, and update the definition of the parent VDS through a script that calls the Dremio API.
- how about if the data is e.g. in Elasticsearch and grows continuously and switches index name every day?
for Elasticsearch, you can create a VDS that queries multiple indexes. So, if you’re creating one per day, you could name each index my_index_2017-09-22 etc, then query them altogether with SELECT * FROM elastic.my_index* or other regular expressions supported by elastic.
We recently added a tutorial to help explain how to query across multiple indexes in Elasticsearch: https://www.dremio.com/tutorials/elasticsearch-sql-query-multiple-indexes/
Hope that helps!
Thanks so much, appreciated.
looking forward to the API documentation to try some scripting or add on based on that.
great news about the Kafka connector also.
Thought I’d just tack onto this thread as I am curious about the status of the Kafka Connector you alluded to. I recall Jacques mentioning it as well at one point.
hi kelly, what is the status of kafka connector or kafka ksql integration?
Hi @kelly, Any progress about Kafka connector?Thanks.
I am interested in hearing about any updates on the Kafka connector as well
Dremio Hub is probably a good place to start
Hi @kelly, any progress about streaming sql?
@koolay, to my knowledge, there has been little focus directed toward developing a Kafka connector.
At this point, I think an ambitious community member will have to take the initiative to implement something and submit it to Dremio Hub.
Hi, my name is Fabrice, I am maintaining the dbt-dremio adapter. Materialize is a postgresql compliant streaming database. Here is my streaming question : I would like to connect dremio to materialize as a postgresql source. If that succeeds, I would be able to use dremio federation capabilities to mix streaming and batch data. Doing so I would not have to ingest and maintain reference and master data in materialize. Did anybody already try that ? I am going to give that idea a try, and give my feedback for whose interested in mixing streaming and batch data in a simple way.