Bad data in TPC-DS catalog_sales.cs_sold_date_sk

I’m trying to run some TPC-DS queries and the date surrogate key in catalog_sales.cs_sold_date_sk contains the wrong data:

select ‘catalog_sales.cs_sold_date_sk-min’, min(cs_sold_date_sk) from catalog_sales
union
select ‘catalog_sales.cs_sold_date_sk-max’, max(cs_sold_date_sk) from catalog_sales
union
select ‘date_dim.d_date_sk-min’, min(d_date_sk) from date_dim
union
select ‘date_dim.d_date_sk-max’, max(d_date_sk) from date_dim
order by 1

Returns:
|catalog_sales.cs_sold_date_sk-max|86399|
|catalog_sales.cs_sold_date_sk-min|0|
|date_dim.d_date_sk-max|2488070|
|date_dim.d_date_sk-min|2415022|

The expected values should be:
|catalog_sales.cs_sold_date_sk-max|2452654|
|catalog_sales.cs_sold_date_sk-min|2450815|
|date_dim.d_date_sk-max|2488070|
|date_dim.d_date_sk-min|2415022|

Any suggestions? Thanks

I assume you are querying the data in the “S3 Sample Source”, is that right? Yes, that data is not correct according to tpc-ds spec, and should not be used for benchmarking purposes.

1 Like

Hey Steven! Yes, that’s right the “S3 Sample Source”. Thanks for responding.