How to repeat the data curation steps in automated way in Cron job for different input files?

KrishnaPG · May 23, 2021, 12:03pm

We have two csv files in S3 that need to be imported and joined, which is straightforward to achieve in the UI.

The problem is, we get new data (two new csv files) every day onto the S3 and we need to repeat the same curation steps (import, join etc.) in automated way, without manually having to do it all again through UI?

Is it possible? Is there a way to save these data curation steps from UI and repeat them later with different file names?

balaji.ramaswamy · May 24, 2021, 6:50am

@KrishnaPG If the new CSV files are coming in to the same folder then once the CSV files are n S3, you can run “ALTER PDS REFRESH METADATA” and Dremio should know about the new files, if they go to a different folder then you should be able to promote to a PDS using an API call Dremio

http://docs.dremio.com/sql-reference/sql-commands/datasets/

KrishnaPG · May 25, 2021, 4:45am

Thank you @balaji.ramaswamy . Currently we join two CSV files to produce a merged dataset (lets call it result dataset).

When new input files come-in daily (two new CSV files each day), how to append these new merged result to the existing result dataset?

In other words, how to add “incremental” data to an existing merged result dataset? In the documentation we do not see anything like “INSERT INTO…” kind of command in the SQL section.

Is it possible to add new results to an existing result dataset (which is already a result of few join operations on old datasets) without losing old results / data?

balaji.ramaswamy · May 25, 2021, 6:44am

@KrishnaPG

Create a folder on S3 called “final”
Say on day 1 it has 2 CSV files
Promote final via Dremio and add the right formatting options

It will turn into a purple icon

On day 2, 2 more CSV files are added under the folder “final”

Wait for 2 hours and you should see the results of all 4 files when you click the purple icon or if you want to see the result at once run the below command

ALTER PDS ,S3-source-name>."final" refresh metadata

http://docs.dremio.com/sql-reference/sql-commands/datasets/

KrishnaPG · May 25, 2021, 7:20am

Thank you @balaji.ramaswamy The part of promoting and updating the folder content is clear.

However, I am facing the below issue. Kindly review the below:

SELECT * FROM (...) file1 INNER JOIN (...) file2 ON file1.col1 = file2.col1

I am merging two csv files on some condition to generate some results (lets call it the “output” dataset). The join query is something as shown above.

Now, tomorrow I will receive two more CSV files (which I promote) and I will be generating another joined “output” dataset tomorrow, which needs to be appended to the today’s output.

I am not clear on what is the best practice on:

How / where should I save the today’s output dataset (the result of the join of the csv files), so that it can grow day by day
Tomorrow, when I get new “output” (from the join of the two new csv files), how do I “append” that new output to the today’s existing output?

Programatically I can run the JOIN SQL queries every day on the CSV files (as per your above suggestions). But I am not clear about saving the generated output of the join operation of the CSV files. Since it is computed data that need to be retained somewhere and need to be appended everyday.

balaji.ramaswamy · May 26, 2021, 5:55am

@KrishnaPG You would need to create a Virtual dataset on the JOIN query and save it under a space/folder. Once you do this and you just refresh metadata for your PDS, Dremio will learn about the new files and the VDS will also use the new files as VDS is just like a database view

http://docs.dremio.com/working-with-datasets/virtual-datasets/

Topic		Replies	Views
New Onboarding Questions	2	1025	January 10, 2020
Create dataset format programmatically	3	1446	August 1, 2023
How to automatically transform a directory with json files into a virtual dataset	5	50	August 26, 2024
Dynamically Change datasource to my dremio query	2	1172	September 24, 2021
Failure mode for Dremio metadata refresh for asynchronously mutating data sets	4	1572	January 26, 2022

How to repeat the data curation steps in automated way in Cron job for different input files?

Related topics