How to add an Amazon S3 data source via REST API?

monocongo · October 23, 2020, 12:01am

I have CSV files in a directory of an S3 bucket. I would like to use all of the files as a single table in Dremio, I think this is possible as long as each file has the same header/columns as the others.

Do I need to first add an Amazon S3 data source using the web UI as shown in the docs or can I somehow add one as a Source using the Catalog API? (I’d prefer the latter.) The REST API documentation doesn’t provide a clear example of how to do this (or I just didn’t get it), and I have been unable to find how to get the “New Amazon S3 Source” configuration screen as shown in the documentation, perhaps because I’ve not logged in as an administrator?

For example, let’s say I have a dataset split over two CSV files in an S3 bucket named examplebucket within a directory named datadir:

s3://examplebucket/datadir/part_0.csv
s3://examplebucket/datadir/part_1.csv

Do I somehow set the S3 bucket/path s3://examplebucket/datadir as a data source and then promote each of the files contained therein (part_0.csv and part_1.csv) as a Dataset? If so then is that sufficient to allow all the files to be used as a single table?

Thanks in advance for any suggestions.

balaji.ramaswamy · October 23, 2020, 6:04am

@monocongo

Try to use “rootpath” ,also if part_0.csv and part_1.csv are from the same ETL and have the same schema then they should go under a folder that has a relevant name and promote that folder. If not then they go as separate datasets

ben · October 23, 2020, 7:01am

@monocongo,

You can use POST catalog/ to create the S3 source in Dremio via the API. For example (using cURL):

curl --request POST \
  --url 'http://localhost:9047/api/v3/catalog/?=' \
  --header 'authorization: _dremio{authorization token}' \
  --header 'content-type: application/json' \
  --data '{
  "entityType": "source",
  "config": {
    "accessKey": "your S3 access key here",
    "accessSecret": "your S3 access secret here",
    "secure": false,
    "allowCreateDrop": true,
    "rootPath": "/",
    "credentialType": "ACCESS_KEY",
    "enableAsync": true,
    "compatibilityMode": false,
    "isCachingEnabled": true,
    "maxCacheSpacePct": 100,
    "requesterPays": false,
    "enableFileStatusCheck": true
  },
  "type": "S3",
  "name": "testing-S3",
  "metadataPolicy": {
    "authTTLMs": 86400000,
    "namesRefreshMs": 3600000,
    "datasetRefreshAfterMs": 3600000,
    "datasetExpireAfterMs": 10800000,
    "datasetUpdateMode": "PREFETCH_QUERIED",
    "deleteUnavailableDatasets": true,
    "autoPromoteDatasets": false
  },
  "accelerationGracePeriodMs": 10800000,
  "accelerationRefreshPeriodMs": 3600000,
  "accelerationNeverExpire": false,
  "accelerationNeverRefresh": false,
  "allowCrossSourceSelection": false,
  "accessControlList": {},
  "permissions": [],
  "checkTableAuthorizer": true
}'

Note that the rootPath here is set to / so you will see all the buckets in this S3 account that the credentials have access to. Then, as @balaji.ramaswamy noted, assuming that part_0.csv and part_1.csv have the same schema, you can promote (format) the datadir folder to a physical dataset which will contain records from both of the files.

monocongo · October 23, 2020, 5:59pm

Thanks @ben and @balaji.ramaswamy, very helpful answers.

I have tried the suggested approach for adding the S3 data source (POST with the catalog endpoint), and I was unsuccessful due to a permissions issue. Maybe this is restricted to admin users only?

Once I have this worked out and s3://mybucket has been added as an Amazon S3 data source entity, then how do I then promote an entire folder of CSV files in that bucket as a single physical dataset in Dremio? (The CSV files will all have the same schema and will be named appropriately for the dataset as advised above.)

balaji.ramaswamy · October 24, 2020, 6:52pm

@monocongo

Yes adding a source is only available for admins
To promote a dataset, you can use UI or API

Topic		Replies	Views
Upload local csv file to server via rest API	6	2911	May 5, 2023
How to add new Data Lake source without using the UI	6	1134	July 23, 2021
Creation of dataset failing through API	9	2561	July 16, 2020
S3 source table not found	1	1002	December 20, 2021
How to promote a file in S3 to Physical Data Set	7	3007	May 11, 2021

How to add an Amazon S3 data source via REST API?

Related topics