Aws s3 dataset query bug (not filtering dir2..n (dir0, dir2 work fine)

vladislav-stolyarov · March 6, 2020, 12:38pm

Hi. i found a bug with aws s3 based dataset filtering by dirN partions starting from dir2 and more.
The bug is that dir2 and deeper work only with = (equals) operator while IN clause and combining using OR: (dir2=1 or dir2=2) is ignored and query return if dir2 was not filtered at all.
My example dataset to reproduce the issue has following s3 folder stucture (3 level):
`d

    01 -> 1 -> 1 ->  6 records (1 file)
             -> 2 -> 6 records (1 file)
             -> 3 -> 6 records (1 file)
          2 -> 1 -> 6 records (1 file)
             -> 2 -> 6 records (1 file)
             -> 3 -> 6 records (1 file)
          3 -> 1 -> 6 records (1 file)
             -> 2 -> 6 records (1 file)
             -> 3 -> 6 records (1 file)`

Queries:
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’*
Output: 18 (Expected 18)
select count() from delete_after_test where dir0=‘01’ and dir1 IN (‘1’, ‘2’)*
Output: 36 (Expected 36)
so far so good.

lets play with dir2 level:
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’ and dir2 = ‘1’*
Output: 6(Expected 6)
select count() from delete_after_test where dir0=‘01’ and dir1 = ‘1’ and dir2 IN (‘1’, ‘2’)*
Output: 18(Expected 12)
Here is the bug…dir2 filter is ignored if IN is used or any other clause/expression based on dir2.
Here is execution plan:

 LogicalAggregate(group=[{0}], EXPR$1=[COUNT()], EXPR$2=[SUM($1)])
      LogicalProject(sessionid=[$3], state=[$4])
        LogicalFilter(condition=[AND(=($0, '01'), =($1, '1'), OR(=($2, '1'), =($2, '2')))])
          ScanCrel(table=[dev_data_lake_raw_darwin.delete_after_test], columns=[`dir0`, `dir1`, `dir2`, `sessionid`, `state`, `data`], splits=[9])

P.S: now to overcome non working IN i am using union all with = predicates in each.

Build 4.0.1-201909191652190301-211720e
Edition Community Edition

Attached last query screenshot.

Topic		Replies	Views
S3 data dir0 is null in 2/3rds of the records	1	1053	October 8, 2018
Couldn’t do query on a physical dataset with special character in the table name	5	1318	March 19, 2021
Complex Row Level Filter is not getting expected results as create function is failed	0	742	September 16, 2022
Partitioned Datasets	2	1497	August 23, 2020
Wildcard queries on other sources besides Elastic	2	1366	March 11, 2019

Aws s3 dataset query bug (not filtering dir2..n (dir0, dir2 work fine)

Related topics