Non-SQL Transformations

Now that I have Dremio set up, I am trying to integrate my workflows. There is one capability I don’t see and I’m wondering if I’m missing something.

I have data that needs a lot of transformation. To do it in SQL is too complex. I use Python scripts for the purpose. It’s important to me to be able to track data history and lineage. Can I do that with Dremio?

It seems to me that to replicate my workflow with Dremio, I need to set my original data as a source, connect to Dremio with Python and run my scripts, then put the transformed data into Dremio as a new dataset. This allows the data to be in Dremio, but someone opening it can’t tell its source, or how it was transformed; that needs to be done outside Dremio, with some naming convention and reference to the script in version control. Does that seem right?

Relatedly, it seems Dremio requires users to make their own Python environments. I have previously worked with Civis Platform, which provides containers for users to run scripts against their data, and which tracks all of the scripts executed. Am I right in thinking Dremio doesn’t have something like that?

Going through the documentation and also through some Dremio courses to me it seems that this is still an open question even after 6 years.

So anyone:

  • Dremio is supposed to manage your data governance
  • You have some complex intermediate transformations that are not possible in SQL
  • You want to provide this transformation to multiple consumers

What is the Dremio approach to do this?

@Nikolai-Hlubek

  • Data Governance in Dremio is via RBAC which is a EE feature. Is your concen that it is not available on CE?
  • Which specific transformations are you not finding SQL
    • You want to provide this transformation to multiple consumers - Did not understand the exact ask here, kidly explain and will try to answer

Thanks
Bali

Dear Bali

Thanks in advance for your help. I’m currently evaluating Dremio for enterprise adoption at my company. We’re seriously considering a purchase, but I’m trying to clarify a few of our use cases first. So for the record, I understand that RBAC is a paid feature and have no concerns about that.

Here’s the use case I’m trying to solve:

  • We use dbt to load, flatten and clean the data. :check_box_with_check:
  • Next, I want to apply a clustering algorithm (e.g., k-means) on several columns and compute the distance from the cluster centroids. This results in a new column added to the original dataset—same number of rows, just enriched with the clustering output.

Since k-means isn’t natively supported in SQL, I plan to use Python to retrieve the data via Arrow Flight, perform the transformation, and generate the new column.

My main question is:
What is the recommended Dremio-native (or Dremio-compatible) way to write the transformed data—including the new column—back into the lakehouse? Ideally, I’d like to preserve schema and table structure where possible.

Would love to hear how others handle this, especially when integrating Python-based transformations into a Dremio pipeline.

Thanks!