I have data on s3 partitioned in custom format, which is not Hive compatible. Is there a way to create a table out of several locations on S3?
examples of locations:
s3://user/name/data-repo/user-repo/10/from_20180101_to_20180201/part-000…parquet
s3://user/name/data-repo/user-repo/10/from_20190101_to_20190501/part-000…parquet
Partition format is custom. Can you please provide an example of how that data query will work? How it can filter out partitions in this case?
s3://user/name/data-repo/user-repo//<compaction_period>/part-000…parquet
The queries we have need to filter out data by different sources in some specific time periods. For example,
SELECT user_id, event_id, date_created
FROM users
WHERE source=10 AND date_created BETWEEN 1538123952000 AND 1546331952000 AND event_id=100
If I promote some sources to tables, then I probably will be able to distinguish between sources by adding them into table name, like users_10. As of compaction periods, I’m not sure how it will work.
The data is partitioned by source and date_created column periods, called “compaction_period” (no column with compaction period present in data).
Sorry, I do not understand how it can prune partitions on data that are not partitioned in hive partition format. Unless you have a way to define partition format on a level of additional metadata like iceberg does, I do not see a way for dremio leverage the fact that data is partitioned by src and date_created periods.
From other words, the problem here that hive partition format for such data would look like:
s3://user/name/data-repo/user-repo/source=10/date_created=20180101/part-000…parquet
whereas the data we have, follows following layout:
s3://user/name/data-repo/user-repo/10/from_20180101_to_20180201/part-000…parquet
Please notice, the data we have does not have column names in its path neither equality sign.
Do you have a way to programatically promote s3 folder to a table? For now, let’s focus on option #1.