Hello
I’ve been playing with Dremio over the last couple of weeks, and, while the basic stuff works very well, I start banging on a wall when it comes to doing exactly what I want.
Let me give you first a bit of context: I have a very normalized dataset composed of about 50 tables (stored as parquet files generated using pyarrow). In the end my objective is to expose the data to a somewhat not very technical audience, having summarized the data into of about 5 denormalized views on which the end user will only have to make basic actions (filter, group by, sum) to create dashboards (I intend to use PowerBI to expose the data).
Here is what I’ve tried so far:
1/ PowerBI can’t query a dataset unless it has schema. Tables do have a schema in Dremio, but can’t be queried using PowerBI because of the addition of the field $dremio$update$, which exists in the metadata but not in the source parquet files
2/ Hence PowerBI must connect to views. These views do not have meta stored in “INFORMATION_SCHEMA”.“COLUMNS” unless they are created using the web gui
3/ I can’t really use the gui to generate the views as my input schema is somewhat dynamic (some columns can be added or removed in an upstream process)
4/ So I thought I could interact with Dremio programmatically from Python to reflect these changes. I’ve managed so far to:
- refresh Dremio metadata (sending through ODBC connection to Dremio ‘ALTER TABLE schema.table REFRESH METADATA’ sql queries
- delete views: ‘DROP VIEW IF EXISTS source.view’
- create views: ‘CREATE OR REPLACE VIEW source.view AS my_custom_sql’ (by the way the REPLACE doesn’t work, which is why I drop the view beforehand)
However this approach, while making the view accessible in Dremio gui, does not generate the metadata in “INFORMATION_SCHEMA”.“COLUMNS”. I have noticed that I need to edit the query and execute it to have the metadata created.
This frequently generates an error (An Unexpected Error Occurred. If the problem persists, please contact support. Session ID: b962ff52-b501-43d7-aa51-5d3dd617f595)
If I slightly change the query (e.g. adding a space anywhere in the SQL editor) and run it again, it works, and generate the metadata if if save the query.
I am wondering if I have a mean to do that from within my Python code, and to avoid going through the gui to refresh these metadata. I have considered updating the “INFORMATION_SCHEMA”.“COLUMNS” table, but it feels very wrong to do that.
Is there anything you can do to help me solve this issue (either find a way to create these metadata, or maybe propose a completely different approach if I am on a wrong path?)
Thanks in advance for your help
Best regards,
Romain