Create table from custom partitions

unknown · September 25, 2020, 9:07am

I have data on s3 partitioned in custom format, which is not Hive compatible. Is there a way to create a table out of several locations on S3?

examples of locations:
s3://user/name/data-repo/user-repo/10/from_20180101_to_20180201/part-000…parquet
s3://user/name/data-repo/user-repo/10/from_20190101_to_20190501/part-000…parquet

balaji.ramaswamy · September 26, 2020, 6:48am

@unknown

You have 2 choices

#1 promote user-repo as a S3 folder and we will treat the 2 sub-folders as partitions
#2 Map to table using AWS Glue

unknown · September 28, 2020, 8:52am

Hi @balaji.ramaswamy,

Thank you for your replay!

Partition format is custom. Can you please provide an example of how that data query will work? How it can filter out partitions in this case?
s3://user/name/data-repo/user-repo//<compaction_period>/part-000…parquet

The queries we have need to filter out data by different sources in some specific time periods. For example,
SELECT user_id, event_id, date_created
FROM users
WHERE source=10 AND date_created BETWEEN 1538123952000 AND 1546331952000 AND event_id=100

If I promote some sources to tables, then I probably will be able to distinguish between sources by adding them into table name, like users_10. As of compaction periods, I’m not sure how it will work.

balaji.ramaswamy · October 1, 2020, 3:24am

@unknown

What column is your partition on ? If the FILTER is on the partition column then will be considered for pruning else it would be a FILTER push down

Thanks
Bali

unknown · October 1, 2020, 9:04am

The data is partitioned by source and date_created column periods, called “compaction_period” (no column with compaction period present in data).

Sorry, I do not understand how it can prune partitions on data that are not partitioned in hive partition format. Unless you have a way to define partition format on a level of additional metadata like iceberg does, I do not see a way for dremio leverage the fact that data is partitioned by src and date_created periods.

From other words, the problem here that hive partition format for such data would look like:
s3://user/name/data-repo/user-repo/source=10/date_created=20180101/part-000…parquet
whereas the data we have, follows following layout:
s3://user/name/data-repo/user-repo/10/from_20180101_to_20180201/part-000…parquet

Please notice, the data we have does not have column names in its path neither equality sign.

Do you have a way to programatically promote s3 folder to a table? For now, let’s focus on option #1.

balaji.ramaswamy · October 2, 2020, 5:57am

@unknown

Do you have a way to programmatically promote s3 folder to a table?

Yes on the S3 source you can click on metadata tab and check “Automatically format files into physical datasets when users issue queries.”

So at whatever level you want to format you can fire the query, you can also promote via rest API

Topic		Replies	Views
S3 select PARQUET Dremio + MinIO	1	1392	August 2, 2021
Folder to Table not working	7	1926	June 13, 2019
Files and Folders	11	2197	March 2, 2018
Dremio seems to be scanning all the partitoned data from s3	5	1602	December 14, 2022
Partition on S3 and creating physical table	1	1380	June 2, 2020

Create table from custom partitions

Related topics