S3 : streaming JSON files in a bucket as new documents / records

Hi,

I created a datasource out of a S3 bucket, which contains JSON files (records or documents).
New JSON files are continuously incoming into the bucket -
However, when querying the datasource with Dremio, new records are not appearing.
Is this pattern applicable with Dremio + S3 ?
Or should the datasource be MongoDB or something else ?

Hi @mlb

How frequent the new files are coming into the bucket?. What is the Metadata refresh policy for the source you have?. Default is 1 hr, meaning metadata for the source will get update every hour, within that hour if there is a new file creates in the source it will not be visible in Dremio.

So you need to set the Metadata refresh time according to your work load,i.e how frequently new files writing and how frequently newly created files queried from Dremio.

@Venugopal_Menda

Thanks @Venugopal_Menda -
I figured this out and reduced it to 1mn. I am considering MongoDB for a realtime alternative datasource.

Is heavy if the metadata is refreshed thousands times per second?

@koolay

What is the reason behind refreshing metadata 1000 times per second, if the entire source need not be refreshed, then you can only refresh the datasets you need by using the below command

https://docs.dremio.com/sql-reference/sql-commands/datasets.html#managing-physical-datasets

@balaji.ramaswamy

Because records of PDS is from the streaming of kafka, and we want to query it by real time.

@koolay

We would be coming up with features that do not require so much metadata requests. Will come as a feature in near time future releases

@balaji.ramaswamy

Thanks for your reply.
Is there a timeline about it?

Dremio is a greate product with best performance.
It is unfortunately that dremio can not support streaming like Delta lake and Hudi.
And the streaming is more and more important.

@koolay

Expect V1 by EOY 2020

Thanks
Bali