Spark with Dremio?

Is it possible to use Spark in conjunction with Dremio? My team and I are not so adept with SQL but we have a good familiarity with Python/Spark/Databricks. We’d like to leverage as much Spark-Dremio integration as possible, within reason. The task is a basic ETL Pipeline – import dataset(s), join and merge with other data sources, export the final table(s). That middle step we’d like to do with Spark/Databricks rather than with SQL queries in order 1) to get better parallelism/performance and 2) to be able to use PySpark rather than SQL.

Thanks in advance for any suggestions, and please correct me where I’m wrong on any assumptions or if there’s a “best practice” way of doing this kind of thing. I’ve not found any good tutorials or articles on it but maybe I haven’t looked in the right spots.

If I understand correctly after watching the video here then I will need to use Arrow Flight to facilitate connectivity between the Dremio cluster and the Spark cluster. Is there example code that shows how this can be done?

@monocongo

Do you want to use Dremio to write Parquet files on to the lake? How about using CTAS?

http://docs.dremio.com/sql-reference/sql-commands/tables.html

Thanks for the suggestion, @balaji.ramaswamy

My question is around how to use datasets which are in Dremio within a Databricks (or vanilla Spark) ETL application/process. Our team has a new project which will need to process data that is available in Dremio. We’d like to continue using Python/Spark/Databricks since we’re familiar with it. But in Dremio it seems that you can only interact with the data using SQL. In itself this isn’t a deal breaker. However we’re unfamiliar with Dremio and it’s not yet obvious to us how to implement unit tests, continuous integration, and version control in an environment such as Dremio. For this reason we assume it may be easier to use the datasets available in Dremio as if they’re just another data source for Spark dataframes. We’re not doing BI or one-off queries, which looks to be Dremio’s forte, instead we’re doing non-interactive batch processing ETL jobs, and these seem to be more suited for Spark/Databricks.

Using Spark in conjunction with Dremio looks to be a new frontier without any sources of documentation. Experts with Arrow may be able to cook up a custom solution but to us it appears to be wizardry.

Am I mistaken in assuming that it’s tricky to set up unit tests and CI/CD for an SQL based system such as Dremio? We have all of this and a lot more set up for our Python/Spark projects, so we’re resisting using a different approach if we can stick with what we have and just use Dremio as another data source for our Spark jobs. But I don’t yet know enough about Dremio to determine if this will work at all, and if it can then how hard is it to stand up?

This blog post describes using Spark and Dremio in conjunction via JDBC. Is this the only way forward without writing some sort of custom Arrow Flight connector? Does the JDBC connection act as a bottleneck, eliminating some of the benefits of Arrow/Dremio?

You can read using flight into a data frame. Pyarrow should be available in databricks env if not you can just install it during deployment. Here are python and java examples: GitHub - dremio-hub/arrow-flight-client-examples

You can also look at this project: GitHub - rymurr/flight-spark-source

1 Like

Another one - GitHub - qwshen/spark-flight-connector: A Spark Connector that reads data from / writes data to Flight end-points with Arrow-Flight and Flight-SQL