We’re currently migrating from SQL Server to Dremio Lakehouse, since Dremio doesn’t support an Auto Increment Primary Key Column, so we need a workaround.
To make it simple, we have data like below
MerchantKey MerchantId Type
1 1 TypeA
2 1 TypeB
3 2 TypeB
MerchantId and Type combo are unique, MerchantKey is a unique auto-increment column based on MerchantId+Type combo
Now I need to replace the MerchantKey column, with a column containing a unique value, based on the MerchantId+Type value
Of course, we can just concat MerchantId+Type into a new column like
“1_TypeA”
“1_TypeB”
“2_TypeB”
which is also unique, but it’s not optimized, we need some int/long value instead.
I did some research and found out that Dremio has a built-in Hash function
But I wonder
- DOES IT OUTPUT DUPLICATE VALUES FOR DIFFERENT INPUTS?
- DOES IT OUTPUT DIFFERENT VALUES FOR THE SAME INPUTS WHEN RERUN?
We have like 10 million values
For example, will below case happen?
hash(“1_TypeA”) = 100
hash(“1_TypeB”) also = 100? => same hash value for different input
hash(“2_TypeB”) = 150 today run
hash(“2_TypeB”) = 200 tomorrow run => different hash value each run for the same input
The doc said that
“HASH is a proprietary function that can accept different input expressions of arbitrary Dremio supported data types and returns a signed value. It is not a cryptographic hash function and should not be used as such.”
Which makes me worry.
There is no such Source Code or something like that, so that I can check how the exact Hash function in Dremio works