We have a Physical Dataset (PDS) that are Parquet files that may contain duplicate rows. We have defined a Virtual Dataset (VDS) that references only this PDS and performs (essentially) a SELECT DISTINCT * FROM PDS. We expected that the Raw Reflection built from this VDS would have filtered out the duplicate data.
The behavior we see is that queries against the VDS need to scan the entire dataset, even for simple queries like SELECT * FROM VDS LIMIT 10. I believe that this is due to the fact that the query needs to process the DISTINCT filter every time, but if the Raw Reflection already had filtered out the duplicate data then it wouldn’t need to perform an entire table scan for each query.
Is my theory correct that the DISTINCT is not considered when building the Raw Reflection?
see @balaji.ramaswamysir we have initial load (historical data) and incremental load (latest data), and in that we are having shreyash bhai as a duplicate record i have try distinct and also try to find way to create ID as PK in dremio, but that option is also not there ! please guide us what we do for this situation
@mevadadhruv Either rewrite your SQL and do not do DISTINCT * as that will give both rows as last name is different. If last names have to be same name and data is incorrect then you can fix data, still it will give both rows as “UPDATEDAT” is different. The other option is to only select DISTINCT ID, LASTNAME, CREATEDAT
i have solved this by running nested query this:-
dt1.UPDATEDAT = (
FROM inventory.“dremio_test1”.“duplicate_test” dt2
WHERE dt2.ID = dt1.ID