Reflection doesn't seem to honor DISTINCT in Virtual Dataset

phillip · March 2, 2022, 5:12am

Hello,

We have a Physical Dataset (PDS) that are Parquet files that may contain duplicate rows. We have defined a Virtual Dataset (VDS) that references only this PDS and performs (essentially) a SELECT DISTINCT * FROM PDS. We expected that the Raw Reflection built from this VDS would have filtered out the duplicate data.

The behavior we see is that queries against the VDS need to scan the entire dataset, even for simple queries like SELECT * FROM VDS LIMIT 10. I believe that this is due to the fact that the query needs to process the DISTINCT filter every time, but if the Raw Reflection already had filtered out the duplicate data then it wouldn’t need to perform an entire table scan for each query.

Is my theory correct that the DISTINCT is not considered when building the Raw Reflection?

balaji.ramaswamy · March 8, 2022, 10:06pm

@phillip I kind of remember something like this, would you be able to pass on the profile for me to validate?

mevadadhruv · September 11, 2023, 12:25pm

so do we have any solutions for this or not cause i having same?

balaji.ramaswamy · September 11, 2023, 8:29pm

@mevadadhruv Can you please send me the profile with and without DISTINCT that generates same set of records?

mevadadhruv · September 12, 2023, 9:37am

see @balaji.ramaswamysir we have initial load (historical data) and incremental load (latest data), and in that we are having shreyash bhai as a duplicate record i have try distinct and also try to find way to create ID as PK in dremio, but that option is also not there ! please guide us what we do for this situation

balaji.ramaswamy · September 14, 2023, 4:03pm

@mevadadhruv Dremio does not have Primary key implementation, DISTINCT is still getting both rows as all columns need to be same, in your case LASTNAME and UPDATEDAT are different

mevadadhruv · September 15, 2023, 9:59am

@balaji.ramaswamy so please suggest us a solution for this becuase we are stuck in this .

balaji.ramaswamy · September 20, 2023, 5:21pm

@mevadadhruv Either rewrite your SQL and do not do DISTINCT * as that will give both rows as last name is different. If last names have to be same name and data is incorrect then you can fix data, still it will give both rows as “UPDATEDAT” is different. The other option is to only select DISTINCT ID, LASTNAME, CREATEDAT

mevadadhruv · October 9, 2023, 11:59am

@balaji.ramaswamy

still i having this kind of data please help us!

balaji.ramaswamy · October 10, 2023, 3:56am

@mevadadhruv

Only your ID and FIRSTNAME column values for the first 2 rows are common, if you see the CREATEDAT and UPDATEDAT are different so that is why DISTINCT is bringing back both rows

SELECT DISTINCT ID, FIRSTNAME should give only one of them

mevadadhruv · October 11, 2023, 12:04pm

@balaji.ramaswamy sorry to disturb you many time but, it’s showing the same thing please help on this

mevadadhruv · October 12, 2023, 9:10am

i have solved this by running nested query this:-
SELECT
dt1.ID,
dt1.FIRSTNAME,
dt1.LASTNAME,
dt1.CREATEDAT,
dt1.UPDATEDAT
FROM
inventory.“dremio_test1”.“duplicate_test” dt1
WHERE
dt1.UPDATEDAT = (
SELECT MAX(dt2.UPDATEDAT)
FROM inventory.“dremio_test1”.“duplicate_test” dt2
WHERE dt2.ID = dt1.ID
);

mevadadhruv · October 12, 2023, 9:11am

thanks for support but please support on arrow flight driver part
In dremio to Qlik Sense, “Arrow Flight SQL Driver” - Dremio

Topic		Replies	Views
Duplicated results when I create a RAW reflection over a VDS	4	1144	December 13, 2018
Reflection on VD using subset of PD data	2	1051	August 8, 2019
Raw reflection not used	13	3084	October 10, 2019
Reflections not used	7	737	June 6, 2023
Virtual Dataset shows stale data	15	3673	September 24, 2018

Reflection doesn't seem to honor DISTINCT in Virtual Dataset

Related topics