Trace record origin in csv data imports

dccompton · November 30, 2018, 2:32pm

I’m very new to Dremio, so accept my apologies if I have missed something obvious!

I’m experimenting with Dremio on a 20-year healthcare research dataset that I’ve managed via ETL and the traditional BI/OLAP approach. The raw format of this data is CSV file, and I’ve had great success by simply pointing Dremio to the folder with the files.

I have an issue though.

The source directory will contain three or four files for each year from about 1995 to present. The problem is that the data itself does not have a field for the year. So once Dremio combines the data in the folder, I might see records like:

“Admissions”, 363
“Admissions”, 456
“Admissions”, 256

Where each record is from a different year.

In my ETL approach, I typically tagged records with the filename so that I always knew the origin of each record (essentially a composite key). The filename also contains the year, so I can create a year field from that during ETL. It would be nice if there was a way for Dremio to automatically identify the origin of a record by tracking the filename, and optionally include that as a field in the view. Is there a Dremio way to do anything like this? I could always modify the data to add a year before it goes into the source folder, but that seems like the old way of thinking.

It seems like the origin would be important for any data loaded through a source folder, particularly if it is data that has slowly changing dimensions.

kelly · November 30, 2018, 7:03pm

You can use directory names in your dataset. For example, your data could be:

/my-data/2018/11/my-files.csv

If you apply the formatting to the my-data directory it will apply to all files and all sub-directories. Those will appear as dir0 (2018), dir1 (11), and so on. You can query those as normal columns.

dccompton · December 1, 2018, 1:31pm

That does it! Thanks.

Also found the documentation for it here:
https://docs.dremio.com/data-sources/files-and-directories.html#partitioned-datasets

Topic		Replies	Views
Keeping (or adding) the name of the source file in the SQL table	2	924	August 6, 2020
Stacking files?	7	2226	December 10, 2019
Extract value from source file to add new column	3	1541	April 5, 2019
Files and Folders	11	2197	March 2, 2018
Is it possible to access filename from query?	6	2573	December 18, 2023

Trace record origin in csv data imports

Related topics