How to repeat the data curation steps in automated way in Cron job for different input files?

We have two csv files in S3 that need to be imported and joined, which is straightforward to achieve in the UI.

The problem is, we get new data (two new csv files) every day onto the S3 and we need to repeat the same curation steps (import, join etc.) in automated way, without manually having to do it all again through UI?

Is it possible? Is there a way to save these data curation steps from UI and repeat them later with different file names?

@KrishnaPG If the new CSV files are coming in to the same folder then once the CSV files are n S3, you can run “ALTER PDS REFRESH METADATA” and Dremio should know about the new files, if they go to a different folder then you should be able to promote to a PDS using an API call Dremio

Thank you @balaji.ramaswamy . Currently we join two CSV files to produce a merged dataset (lets call it result dataset).

When new input files come-in daily (two new CSV files each day), how to append these new merged result to the existing result dataset?

In other words, how to add “incremental” data to an existing merged result dataset? In the documentation we do not see anything like “INSERT INTO…” kind of command in the SQL section.

Is it possible to add new results to an existing result dataset (which is already a result of few join operations on old datasets) without losing old results / data?


Create a folder on S3 called “final”
Say on day 1 it has 2 CSV files
Promote final via Dremio and add the right formatting options

It will turn into a purple icon

On day 2, 2 more CSV files are added under the folder “final”

Wait for 2 hours and you should see the results of all 4 files when you click the purple icon or if you want to see the result at once run the below command

ALTER PDS ,S3-source-name>."final" refresh metadata

Thank you @balaji.ramaswamy The part of promoting and updating the folder content is clear.

However, I am facing the below issue. Kindly review the below:

SELECT * FROM (...) file1 INNER JOIN (...) file2 ON file1.col1 = file2.col1

I am merging two csv files on some condition to generate some results (lets call it the “output” dataset). The join query is something as shown above.

Now, tomorrow I will receive two more CSV files (which I promote) and I will be generating another joined “output” dataset tomorrow, which needs to be appended to the today’s output.

I am not clear on what is the best practice on:

  1. How / where should I save the today’s output dataset (the result of the join of the csv files), so that it can grow day by day
  2. Tomorrow, when I get new “output” (from the join of the two new csv files), how do I “append” that new output to the today’s existing output?

Programatically I can run the JOIN SQL queries every day on the CSV files (as per your above suggestions). But I am not clear about saving the generated output of the join operation of the CSV files. Since it is computed data that need to be retained somewhere and need to be appended everyday.

@KrishnaPG You would need to create a Virtual dataset on the JOIN query and save it under a space/folder. Once you do this and you just refresh metadata for your PDS, Dremio will learn about the new files and the VDS will also use the new files as VDS is just like a database view