UNION not removing duplicates


#1

Hi Everyone

Read that Dremio doesn’t support CDC process so I was trying a work around using unions and this post from @can by overwriting and sourcing tables in the $scratch space

But when I union my historical data with file daily file, duplicates are being retained
Is there a way i can remove them ?

Thanks and Regards
vyz


#2

Hi @vyz101

Is there a chance the values that are not getting suppressed may have trailing or leading white spaces ?and Dremio will treat them as different values

Thanks
@balaji.ramaswamy


#3

Hi @balaji.ramaswamy

I didnt have trailing leading or trailing spaces here is a sample of the data with which i had issue with

ATMID ZIP
2000 60647
2001 60648
2002 60649
and they are sourced from csv files But later today working with some tables from Mongo DB I didnt face the issue either
_id Item Qty
XFxAMqgjEYk4AZ5z Watch 12
XFxAUagjEYk4AZ50 Car 23
Is it the unstructuredness of the csv not allowing the dremio to identify the row key ?

#4

we see this alot and basically have to wrap most of these union queries in a subselect with a distinct


#5

I had tried distinct but that didn’t help
Here the query I ran
“SELECT DISTINCT ATMID,ZIP FROM
(Select * FROM ATMDATA
UNION
Select * FROM ATMDATA_CDC)
ORDER BY ATMID ASC”


#6

@vyz101

You are doing a select * from, are you saying every column in the row is an exact duplicate of the other row. Only then union will suppress, even if you have one column that has a different value, it will come back

Thanks
@balaji.ramaswamy