Hi. i found a bug with aws s3 based dataset filtering by dirN partions starting from dir2 and more.
The bug is that dir2 and deeper work only with = (equals) operator while IN clause and combining using OR: (dir2=1 or dir2=2) is ignored and query return if dir2 was not filtered at all.
My example dataset to reproduce the issue has following s3 folder stucture (3 level):
`d
01 -> 1 -> 1 -> 6 records (1 file)
-> 2 -> 6 records (1 file)
-> 3 -> 6 records (1 file)
2 -> 1 -> 6 records (1 file)
-> 2 -> 6 records (1 file)
-> 3 -> 6 records (1 file)
3 -> 1 -> 6 records (1 file)
-> 2 -> 6 records (1 file)
-> 3 -> 6 records (1 file)`
Queries:
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’*
Output: 18 (Expected 18)
select count() from delete_after_test where dir0=‘01’ and dir1 IN (‘1’, ‘2’)*
Output: 36 (Expected 36)
so far so good.
lets play with dir2 level:
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’ and dir2 = ‘1’*
Output: 6(Expected 6)
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’ and dir2 IN (‘1’, ‘2’)*
Output: 18(Expected 12)
Here is the bug…dir2 filter is ignored if IN is used or any other clause/expression based on dir2.
Here is execution plan:
LogicalAggregate(group=[{0}], EXPR$1=[COUNT()], EXPR$2=[SUM($1)]) LogicalProject(sessionid=[$3], state=[$4]) LogicalFilter(condition=[AND(=($0, '01'), =($1, '1'), OR(=($2, '1'), =($2, '2')))]) ScanCrel(table=[dev_data_lake_raw_darwin.delete_after_test], columns=[`dir0`, `dir1`, `dir2`, `sessionid`, `state`, `data`], splits=[9])
P.S: now to overcome non working IN i am using union all with = predicates in each.
Build 4.0.1-201909191652190301-211720e
Edition Community Edition
Attached last query screenshot.