Best practices for querrying dremio from airflow dag

Dear Community,

in my company we have a DataLake/Lakehouse implemented with Dremio (25.2.8).

In my team we would like to query data via dremio and use this data in an airflow dag.

Currently the only working option is to connect over a microsoft sql Server, where dremio is configured as linked server. This is not optimal for more than one reason.

My question now is: What is the best practice to configure a connection from within airflow to query Dremio?

After searching I found this article Orchestration of Dremio with Airflow and CRON Jobs | Dremio, which recommends the dremio-simple-query package. But I am unsure because it seems to be outdated and also the title of the article is “Orchestration of Dremio” (which is not what we want, we just want do get data via Dremio)…
Is there another recommended way - or a predefined airflow hook (could not find any yet) ?

Thank you in advance!

In the end, Airflow is just an orchestration tool, not an ETL tool. So my advice is not to query and transform SQL data directly in Airflow (even though Airflow has a hook feature to query data from different kinds of databases).

In my company, we use dbt-dremio to transform data, then use Airflow just purely for scheduling the dbt job.

If you just want to query data, why not use DremioUI, Dbeaver, or a SQL query tool? Use Airflow to query data sounds weird

@trainrunner What kind of queries are these? If they are aggregated queries, probably run them from Tableau or PowerBI.

Thank you for your answer @quangbilly79. Indeed, I forgot to mention, that we not only want to query the data but also process it - so yes, you are right, probably a ETL tool would be more appropriate. I thought for “small” ETL jobs airflow is sufficient and you dont need the extra layer of an ETL Tool.

@balaji.ramaswamy Different kind of queries. For some dags we just have some simple (unaggregated) SELECT queries and use python to further process the data. For others we have aggregated queries. We wanted to use airflow for automation, since it provides a convenient way for scheduling / orchestrating.

Maybe just wrap everything with Python code (which uses JDBC/pyodbc to query data from Dremio and handle small transformations), then schedule it in Airflow with PythonOperator. What is wrong with that method? I’m pretty sure that you can input parameters and even store the query’s results for the next step in Airflow using PythonOperator. Also, by coding in Python, you can do basically everything

Thank you for the suggestion! Yes that should be a valid approach. I was just curious if someone has experience with the dremio-simple-query package, which uses an arrow-flight connection instead of odbc. It is mentioned in this article from dremio (Orchestration of Dremio with Airflow and CRON Jobs | Dremio) but the repo looks a bit outdated. I would expect that by using arrow-flight we would have a better performance if we need to load larger datasets from dremio…

You can just check the source code of the library

It’s quite simple, and I think the flow is somewhat like below

Python → PyArrow FlightClient → Dremio Flight Server (gRPC) → Arrow Table

which I think using PyArrow as you want, right?

Update: Ended up writing our own Airflow Plugin & Hook - similar to the mentioned code from @quangbilly79 from the dremio-simple-query package.

Advantages: Abstraction of Connection details (like getting Auth-Token) and safe credential access via Airflow Connection.

Placed the customHook.py File unter a new folder /plugins/hooks and added snippets as described in Plugins — Airflow 3.1.2 Documentation to register as a module in Airflow.

Thanks again for the discussion and your support!

1 Like