Generating views metadata programmatically?

Hello

I’ve been playing with Dremio over the last couple of weeks, and, while the basic stuff works very well, I start banging on a wall when it comes to doing exactly what I want.

Let me give you first a bit of context: I have a very normalized dataset composed of about 50 tables (stored as parquet files generated using pyarrow). In the end my objective is to expose the data to a somewhat not very technical audience, having summarized the data into of about 5 denormalized views on which the end user will only have to make basic actions (filter, group by, sum) to create dashboards (I intend to use PowerBI to expose the data).

Here is what I’ve tried so far:
1/ PowerBI can’t query a dataset unless it has schema. Tables do have a schema in Dremio, but can’t be queried using PowerBI because of the addition of the field $dremio$update$, which exists in the metadata but not in the source parquet files
2/ Hence PowerBI must connect to views. These views do not have meta stored in “INFORMATION_SCHEMA”.“COLUMNS” unless they are created using the web gui
3/ I can’t really use the gui to generate the views as my input schema is somewhat dynamic (some columns can be added or removed in an upstream process)
4/ So I thought I could interact with Dremio programmatically from Python to reflect these changes. I’ve managed so far to:

  • refresh Dremio metadata (sending through ODBC connection to Dremio ‘ALTER TABLE schema.table REFRESH METADATA’ sql queries
  • delete views: ‘DROP VIEW IF EXISTS source.view’
  • create views: ‘CREATE OR REPLACE VIEW source.view AS my_custom_sql’ (by the way the REPLACE doesn’t work, which is why I drop the view beforehand)
    However this approach, while making the view accessible in Dremio gui, does not generate the metadata in “INFORMATION_SCHEMA”.“COLUMNS”. I have noticed that I need to edit the query and execute it to have the metadata created.
    This frequently generates an error (An Unexpected Error Occurred. If the problem persists, please contact support. Session ID: b962ff52-b501-43d7-aa51-5d3dd617f595)
    If I slightly change the query (e.g. adding a space anywhere in the SQL editor) and run it again, it works, and generate the metadata if if save the query.

I am wondering if I have a mean to do that from within my Python code, and to avoid going through the gui to refresh these metadata. I have considered updating the “INFORMATION_SCHEMA”.“COLUMNS” table, but it feels very wrong to do that.

Is there anything you can do to help me solve this issue (either find a way to create these metadata, or maybe propose a completely different approach if I am on a wrong path?)

Thanks in advance for your help

Best regards,
Romain

Hi Romain,

Someone else may be able to jump in with regards to instructions for best interacting with Dremio programmatically, however I just wanted to let you know that some of your issues using Power BI with dremio are known. The problems relating to the $dremio$update$ addition and the requirement for the query to be edited / executed before metadata will be generated have been identified and fixed for the next release. Hopefully that will make your use case a lot more streamlined.

Hi Romain - have you had a look at the REST API?

https://docs.dremio.com/rest-api/

Thanks for your answers.
Josh: glad to hear that, I’ll check that out as soon as you release your next version
Kelly : I have now, but as far as I understand one can only interact with the sources using the REST API, not with the spaces, neither with the views they contain. Or am I missing something ?

I’ve been trying to understand how the api used by Dremio web app works (apiv2), but I couldn’t get where the newVersion used in the different post requests (e.g. http://localhost:9047/apiv2/datasets/new_untitled_sql?newVersion=0007097579972195 or http://localhost:9047/apiv2/dataset/GMD.AC_CLASS/version/00000161e1171ceb/transformAndPreview?newVersion=0005553268243220&limit=150) comes from. Any idea on how I could retrieve this value ?
Thanks
Romain

Hi Romain,

These API calls are used internally for dataset management and as such this new version value is generated by Dremio. Other than use the REST API, are you able to use the ODBC driver from your python application to run the queries you want?

Hi Josh
I finally found my way out of this apiv2 (even though using it is probably a bad idea, I do it at my own risk). Now it works well end to end, I manage to have the views with their metadata up to date in PowerBI even after a source schema change.
Now I’m trying to find how to do the best use of accelerations, more on that on a separate post.

Hi Romain,

We are targeting our next major release to have a documented SQL REST API endpoint as well as an endpoint for reflection management.

The SQL endpoint will allow submitting SQL queries, monitoring the queries and retrieving the results when they complete.