S3 : streaming JSON files in a bucket as new documents / records

mlb · June 5, 2019, 10:35am

Hi,

I created a datasource out of a S3 bucket, which contains JSON files (records or documents).
New JSON files are continuously incoming into the bucket -
However, when querying the datasource with Dremio, new records are not appearing.
Is this pattern applicable with Dremio + S3 ?
Or should the datasource be MongoDB or something else ?

Venugopal_Menda · June 6, 2019, 5:03am

Hi @mlb

How frequent the new files are coming into the bucket?. What is the Metadata refresh policy for the source you have?. Default is 1 hr, meaning metadata for the source will get update every hour, within that hour if there is a new file creates in the source it will not be visible in Dremio.

So you need to set the Metadata refresh time according to your work load,i.e how frequently new files writing and how frequently newly created files queried from Dremio.

@Venugopal_Menda

mlb · June 6, 2019, 7:13am

Thanks @Venugopal_Menda -
I figured this out and reduced it to 1mn. I am considering MongoDB for a realtime alternative datasource.

koolay · September 4, 2020, 2:23am

Is heavy if the metadata is refreshed thousands times per second?

balaji.ramaswamy · September 8, 2020, 6:17am

@koolay

What is the reason behind refreshing metadata 1000 times per second, if the entire source need not be refreshed, then you can only refresh the datasets you need by using the below command

https://docs.dremio.com/sql-reference/sql-commands/datasets.html#managing-physical-datasets

koolay · September 8, 2020, 8:15am

@balaji.ramaswamy

Because records of PDS is from the streaming of kafka, and we want to query it by real time.

balaji.ramaswamy · September 26, 2020, 7:18am

@koolay

We would be coming up with features that do not require so much metadata requests. Will come as a feature in near time future releases

koolay · September 29, 2020, 6:45am

@balaji.ramaswamy

Thanks for your reply.
Is there a timeline about it?

Dremio is a greate product with best performance.
It is unfortunately that dremio can not support streaming like Delta lake and Hudi.
And the streaming is more and more important.

balaji.ramaswamy · October 1, 2020, 3:13am

@koolay

Expect V1 by EOY 2020

Thanks
Bali

Topic		Replies	Views
Refresh Metadata Taking Ling Time	15	4081	February 25, 2021
Near real time metadata refresh	8	2451	December 10, 2021
Dremio Refreshing Data	3	3522	May 12, 2020
Dremio s3 metadata storage	19	208	January 17, 2025
AWS S3 costs caused ListBucket request on metadata refresh Dremio University	6	1877	November 10, 2021

S3 : streaming JSON files in a bucket as new documents / records

Related topics