Partitioned Datasets

ayaueto · August 17, 2020, 5:18pm

I have an S3 bucket with nested folders. Each level has either folders or files. The files in each level do not necessarily share the same structure. In the 3rd level, there are thousands of json files with mostly the same structure.
Let’s say I am only interested in the json file in the 3rd level where the filename is “file_3_10.json”. This file is found in the folder with folder name “folder_10”. This folder is located in the parent folder “run”. This “run” folder is in the top folder “data”.
data / run / [foldername=“folder10” / [flename=“file_3_10.json”\

What should the SQL statement be?
And what is the “FROM” source? Do I need to make each folder a dataset in order to query it? Or can I just use the file system structure without having to first turn each folder into a dataset?
Thanks,

datocrats-org · August 19, 2020, 4:40pm

I think each batch file-based data source query is a union per-folder. I think if you could identify which subfolders contain the same structure, loop through those folders in union statement. If it’s not repeatable from multiple physical data set folders in the same query, you could create a bunch of SELECT queries in a scripted manner using the Dremio API example using python here, then union them all together, also.

balaji.ramaswamy · August 23, 2020, 6:28am

@ayaueto

You can promote at any level using the Post Catalog {id} REST call

http://docs.dremio.com/rest-api/catalog/post-catalog-id.html

Thanks
Bali

Topic		Replies	Views
Wildcard on S3 Queries Dremio University	2	1726	April 15, 2020
Problem on create large dataset	2	996	May 28, 2020
Unable to find bucket named error when querying S3	9	2445	April 20, 2020
How to add an Amazon S3 data source via REST API?	4	2469	October 24, 2020
Creation of dataset failing through API	9	2545	July 16, 2020

Partitioned Datasets

Related topics