I have some parquet file stored on S3. Each parquet is a table on Dremio, for legacy reasons. Some of them ( for an update on the data) has a new Column. I need to query the new Column if it exist otherwise return a default value ( not NULL).
I have created something the following query ( based on stackoverflow example) but it seems it don’t work:
SELECT
CASE WHEN EXISTS (
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'test' AND TABLE_NAME = 'test_table' AND COLUMN_NAME = 'Column'
)
THEN (
SELECT COLUMN_NAME
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_SCHEMA = 'test' AND TABLE_NAME = 'test_table' AND COLUMN_NAME = 'Column'
)
ELSE 'Default' END AS Column
FROM test."test_table"
The solution provide the excution of two queries programmatically. The first check if columkn exist for the table in the INFORMATION_SCHEMA
and the secon add the column to the SELECT
if the column exist otherwise add it with the default value.
Is there a way to do it in Dremio without making two query programatically?
@stefano.castoldi Once Dremio does schema learning, it should automatically learn the new column, fill in values in Parquet files it exists and leave the rest as null, just to test this, what if you run the below query
select * from <application_table_name> where <column_name> is not null
If this returns the right set of rows which means the rows where the column exists then we can write a CASE statement. If this query returns all rows including the ones the column does not exists then check the value of this column when it foes not exists and use that logic
@balaji.ramaswamy I don’t understand what you mean. Maybe my question was a little bit confusing.
I have two parquet file created at different time :
-
first has column1, column2, column3 when I create dremio mapping(table1) i have a table with column1, column2, column3
-
second has column1, column2, column3, column4 when I create dremio mapping (table2) i have a table with column1, column2, column3,column4
If I make your query on table1 I get an error that the column dose not exist so the query is not executed.
select * from table1 where column4 is not null
if I make the query on the second table it works, because the table2 has the column.
Can you provide me an example?
Hi @stefano.castoldi If columns 1,2,3 are same on both Parquet files, can you consider promoting them under one folder as one PDS so that dataset can have 4 columns with columns 1,2,3 having all rows while column 4 will only populate from the 2 parquet file
I know that it’s the solution, but for legacy reason I can’t do that right now. The only way to solve the problem is to it programmatically through the query of INFORMATION_SCHEMA
.