I have a lot of JSON files (gzipped) inside a folder structure with YEAR/MONTH/DAY/HOUR on AWS S3.
Examples of file paths:
And I created a date variable to make my searches easier, like so:
SELECT TO_DATE(CONCAT(dir0,'-',dir1,'-',dir2), 'YYYY-MM-DD') created_at_day_partition, *
And, in reflection, I market the
created_at_day_partition column to partition data.
Is the technique I used a good or bad approach to take advantage of searches on my folder structure?
Or shoud I use the
dir0, dir1, dir2 columns explicitidly on my queries instead of searching over
@almirb Do your queries typically query on a particular date? Would you be able to send the refresh reflection and query that runds on the VDS profiles?
If your queries are accelerated by the reflection, it does not matter how files are stored originally, they would only be scanned during reflection refresh.
If your queries explicitly use
created_at_day_partition column, then partitioning reflection by this column may improve performance, but not necessarily. Under the hood reflection is nothing else other than an Iceberg table, and even without explicit partitioning Dremio could be pretty efficient by breaking the data into smaller files and using metadata to prune unneeded files during query execution. It worth trying various partitioning strategies with real queries and seeing performance impact.