Hello,
We are evaluating Dremio Enterprise (Using it via AWS Marketplace).
And we are facing some strange behavior.
We have a PDS that have a “change log of events” like this:
ID, STATUS, DATE, dir0, dir1, dir2
1, status1, 2021-09-15, year=2021, month=09, day=15
1, status2, 2021-09-16, year=2021, month=09, day=16
1, status3, 2021-09-20, year=2021, month=09, day=20
2, status1, 2021-09-13, year=2021, month=09, day=13
2, status2, 2021-09-14, year=2021, month=09, day=14
…
This PDS is partitioned using year, month and day.
But we only want the latest row for each ID based on the date column, then we created an VDS using this query:
SELECT ID, STATUS, DATE, dir0, dir1, dir2 FROM
(select *,
row_number() over (partition by ID order by DATE desc) as line_num
from RAW_CHANGELOG_TABLE) nested_0 where line_num = 1
The problem is, when we try to query this VDS using some partition filters, it seems that those filter are not used by the “PDS” from the source.
We always get memory exceeded even if we query something small like:
SELECT ID,STATUS,DATE FROM VDS_DATASET WHERE dir0='year=2021' and dir1='month=09' limit 10;
It seems that every time the query is doing a full scan at the PDS level.
There is some way to solve this problem?
Thanks!