How to add an Amazon S3 data source via REST API?

I have CSV files in a directory of an S3 bucket. I would like to use all of the files as a single table in Dremio, I think this is possible as long as each file has the same header/columns as the others.

Do I need to first add an Amazon S3 data source using the web UI as shown in the docs or can I somehow add one as a Source using the Catalog API? (I’d prefer the latter.) The REST API documentation doesn’t provide a clear example of how to do this (or I just didn’t get it), and I have been unable to find how to get the “New Amazon S3 Source” configuration screen as shown in the documentation, perhaps because I’ve not logged in as an administrator?

For example, let’s say I have a dataset split over two CSV files in an S3 bucket named examplebucket within a directory named datadir:

s3://examplebucket/datadir/part_0.csv
s3://examplebucket/datadir/part_1.csv

Do I somehow set the S3 bucket/path s3://examplebucket/datadir as a data source and then promote each of the files contained therein (part_0.csv and part_1.csv) as a Dataset? If so then is that sufficient to allow all the files to be used as a single table?

Thanks in advance for any suggestions.

@monocongo

Try to use “rootpath” ,also if part_0.csv and part_1.csv are from the same ETL and have the same schema then they should go under a folder that has a relevant name and promote that folder. If not then they go as separate datasets

1 Like

@monocongo,

You can use POST catalog/ to create the S3 source in Dremio via the API. For example (using cURL):

curl --request POST \
  --url 'http://localhost:9047/api/v3/catalog/?=' \
  --header 'authorization: _dremio{authorization token}' \
  --header 'content-type: application/json' \
  --data '{
  "entityType": "source",
  "config": {
    "accessKey": "your S3 access key here",
    "accessSecret": "your S3 access secret here",
    "secure": false,
    "allowCreateDrop": true,
    "rootPath": "/",
    "credentialType": "ACCESS_KEY",
    "enableAsync": true,
    "compatibilityMode": false,
    "isCachingEnabled": true,
    "maxCacheSpacePct": 100,
    "requesterPays": false,
    "enableFileStatusCheck": true
  },
  "type": "S3",
  "name": "testing-S3",
  "metadataPolicy": {
    "authTTLMs": 86400000,
    "namesRefreshMs": 3600000,
    "datasetRefreshAfterMs": 3600000,
    "datasetExpireAfterMs": 10800000,
    "datasetUpdateMode": "PREFETCH_QUERIED",
    "deleteUnavailableDatasets": true,
    "autoPromoteDatasets": false
  },
  "accelerationGracePeriodMs": 10800000,
  "accelerationRefreshPeriodMs": 3600000,
  "accelerationNeverExpire": false,
  "accelerationNeverRefresh": false,
  "allowCrossSourceSelection": false,
  "accessControlList": {},
  "permissions": [],
  "checkTableAuthorizer": true
}'

Note that the rootPath here is set to / so you will see all the buckets in this S3 account that the credentials have access to. Then, as @balaji.ramaswamy noted, assuming that part_0.csv and part_1.csv have the same schema, you can promote (format) the datadir folder to a physical dataset which will contain records from both of the files.

1 Like

Thanks @ben and @balaji.ramaswamy, very helpful answers.

I have tried the suggested approach for adding the S3 data source (POST with the catalog endpoint), and I was unsuccessful due to a permissions issue. Maybe this is restricted to admin users only?

Once I have this worked out and s3://mybucket has been added as an Amazon S3 data source entity, then how do I then promote an entire folder of CSV files in that bucket as a single physical dataset in Dremio? (The CSV files will all have the same schema and will be named appropriately for the dataset as advised above.)

@monocongo

  • Yes adding a source is only available for admins
  • To promote a dataset, you can use UI or API

1 Like