Trace record origin in csv data imports

I’m very new to Dremio, so accept my apologies if I have missed something obvious!

I’m experimenting with Dremio on a 20-year healthcare research dataset that I’ve managed via ETL and the traditional BI/OLAP approach. The raw format of this data is CSV file, and I’ve had great success by simply pointing Dremio to the folder with the files.

I have an issue though.

The source directory will contain three or four files for each year from about 1995 to present. The problem is that the data itself does not have a field for the year. So once Dremio combines the data in the folder, I might see records like:

“Admissions”, 363
“Admissions”, 456
“Admissions”, 256

Where each record is from a different year.

In my ETL approach, I typically tagged records with the filename so that I always knew the origin of each record (essentially a composite key). The filename also contains the year, so I can create a year field from that during ETL. It would be nice if there was a way for Dremio to automatically identify the origin of a record by tracking the filename, and optionally include that as a field in the view. Is there a Dremio way to do anything like this? I could always modify the data to add a year before it goes into the source folder, but that seems like the old way of thinking.

It seems like the origin would be important for any data loaded through a source folder, particularly if it is data that has slowly changing dimensions.

You can use directory names in your dataset. For example, your data could be:

/my-data/2018/11/my-files.csv

If you apply the formatting to the my-data directory it will apply to all files and all sub-directories. Those will appear as dir0 (2018), dir1 (11), and so on. You can query those as normal columns.

1 Like

That does it! Thanks.

Also found the documentation for it here:
https://docs.dremio.com/data-sources/files-and-directories.html#partitioned-datasets

1 Like