S3 : streaming JSON files in a bucket as new documents / records


I created a datasource out of a S3 bucket, which contains JSON files (records or documents).
New JSON files are continuously incoming into the bucket -
However, when querying the datasource with Dremio, new records are not appearing.
Is this pattern applicable with Dremio + S3 ?
Or should the datasource be MongoDB or something else ?

Hi @mlb

How frequent the new files are coming into the bucket?. What is the Metadata refresh policy for the source you have?. Default is 1 hr, meaning metadata for the source will get update every hour, within that hour if there is a new file creates in the source it will not be visible in Dremio.

So you need to set the Metadata refresh time according to your work load,i.e how frequently new files writing and how frequently newly created files queried from Dremio.


Thanks @Venugopal_Menda -
I figured this out and reduced it to 1mn. I am considering MongoDB for a realtime alternative datasource.

Is heavy if the metadata is refreshed thousands times per second?


What is the reason behind refreshing metadata 1000 times per second, if the entire source need not be refreshed, then you can only refresh the datasets you need by using the below command



Because records of PDS is from the streaming of kafka, and we want to query it by real time.


We would be coming up with features that do not require so much metadata requests. Will come as a feature in near time future releases


Thanks for your reply.
Is there a timeline about it?

Dremio is a greate product with best performance.
It is unfortunately that dremio can not support streaming like Delta lake and Hudi.
And the streaming is more and more important.


Expect V1 by EOY 2020