How can I optimize my searches using my JSON folder structure?

almirb · November 24, 2023, 9:30pm

Hi!
I have a lot of JSON files (gzipped) inside a folder structure with YEAR/MONTH/DAY/HOUR on AWS S3.

Examples of file paths:

2023/08/01/00_00_to_00_59.gz
2023/08/01/01_00_to_01_59.gz
2023/08/01/02_00_to_02_59.gz

And I created a date variable to make my searches easier, like so:

SELECT TO_DATE(CONCAT(dir0,'-',dir1,'-',dir2), 'YYYY-MM-DD') created_at_day_partition, *
FROM analytics

And, in reflection, I market the created_at_day_partition column to partition data.

Is the technique I used a good or bad approach to take advantage of searches on my folder structure?
Or shoud I use the dir0, dir1, dir2 columns explicitidly on my queries instead of searching over created_at_day_partition column?

Thanks!

balaji.ramaswamy · November 28, 2023, 7:06am

@almirb Do your queries typically query on a particular date? Would you be able to send the refresh reflection and query that runds on the VDS profiles?

igor.lobanov · November 29, 2023, 4:45pm

If your queries are accelerated by the reflection, it does not matter how files are stored originally, they would only be scanned during reflection refresh.

If your queries explicitly use created_at_day_partition column, then partitioning reflection by this column may improve performance, but not necessarily. Under the hood reflection is nothing else other than an Iceberg table, and even without explicit partitioning Dremio could be pretty efficient by breaking the data into smaller files and using metadata to prune unneeded files during query execution. It worth trying various partitioning strategies with real queries and seeing performance impact.

Topic		Replies	Views
Large Reflection creation, speed and performance	4	2238	April 16, 2019
Cannot use partitions filters using VDS	4	973	September 29, 2021
S3 folders with extreme amount of files at the top level (not partitioned parquet-style) Dremio University	1	1776	November 3, 2020
Partitioned Datasets	2	1504	August 23, 2020
Hit a system limit..... how do I work around it?	4	1452	October 5, 2020

How can I optimize my searches using my JSON folder structure?

Related topics