Partitioned Datasets

I have an S3 bucket with nested folders. Each level has either folders or files. The files in each level do not necessarily share the same structure. In the 3rd level, there are thousands of json files with mostly the same structure.
Let’s say I am only interested in the json file in the 3rd level where the filename is “file_3_10.json”. This file is found in the folder with folder name “folder_10”. This folder is located in the parent folder “run”. This “run” folder is in the top folder “data”.
data / run / [foldername=“folder10” / [flename=“file_3_10.json”\

What should the SQL statement be?
And what is the “FROM” source? Do I need to make each folder a dataset in order to query it? Or can I just use the file system structure without having to first turn each folder into a dataset?

I think each batch file-based data source query is a union per-folder. I think if you could identify which subfolders contain the same structure, loop through those folders in union statement. If it’s not repeatable from multiple physical data set folders in the same query, you could create a bunch of SELECT queries in a scripted manner using the Dremio API example using python here, then union them all together, also.


You can promote at any level using the Post Catalog {id} REST call