Embed spark into dremio

Hi,
I would suggest to embed spark along dremio on the different nodes in order to build a great data integration platform à la Databricks. The later has developped the DBIO layer which is quite similar to the Dremio abstraction layer…but dremio goes further by proposing different tools like data prep, data curation, … It will be even interesting if deploying on kubernetes becomes possible, enabling de facto a dynamic scalability for the compute, and then improve the performances without having to statically deploy a big number of nodes to do it, especially usefull in the cloud context.
To summurize :slight_smile:

  • dremio
  • spark
  • zeppelin/jupyter
  • kubernetes/docker
  • supersets
    = the killer platform

another useful feature : give the ability to generate spark code from the data transformations done on the virtual datasets, like it is done in dataiku…

the latesdt one :
Microsoft among others integrates the ability to make machine learning in SQL server , eg https://docs.microsoft.com/en-us/sql/advanced-analytics/r/sqldev-train-and-save-a-model-using-t-sql?view=sql-server-2017.
It could be nice to use similar techniques in dremio sql editor to avoid the need to use for instance spark or h2O or whatever outside dremio. Possibly, you could use spark inside the sql editor insised a sql query?
xavier

1 Like

In our future roadmap, we are already considering looking at other execution engines such as Spark. However, it may be worth mentioning that one major disadvantage of Spark and one advantage of what we’re doing today with Arrow is the extreme low-latency we strive for. Because of this, our current execution can be much more performant than Spark.

We currently have out-of-the-box support for Jupyter notebooks.

Kubernetes/Docker are both technically feasible today. We have plans to automate this in the future in our UI, but nothing is preventing one from it now. Also, IIRC, there are actually some in the community doing & deploying it today :slight_smile:

what would be the advantages of spark excepted for distributed machine learning?
shoud spark use apache arrow (more than for python integration) for accelerating their processing?

I’ve seen some times ago a project which uses arrow for columnar processing; maybe culd it be used for dremio engine in the future?

Spark use cases are fundamentally different. An example/advantage of using Spark is when you are running predictive models - leveraging Spark MLlib

The top project that uses Arrow for columnar processing is probably Dremio :wink:

what you say is interesting, because it says that the initial scope of spark is being reducing to ML scope (maybe DL in the future, even if there are plenty of DL frameworks today), and also graph processing. How do you see the future of BigData processing then?

One way to think of Spark+Dremio is using Spark as a client and Dremio as the data access platform underneath. Currently the integration is not ideal, but we hope to soon make an Arrow RPC available which would allow for a very nice, efficient, and fast integration between the two systems.

In terms of SQL engines we want people to have options and to use the one that makes the most sense for their use case. Currently queries are executed in our own engine, called Sabot, which is Arrow-based. However, there may be reasons to use other engines like Presto and Impala. Dremio’s vertically integrated query engine and it’s ability to generate query substitutions for different representations of data is compatible with other SQL engines.

We aren’t there yet but this is something we are working on.

thanks kelly for these explanations.

What i don’t understand is that spark has also a in-memory distributed way of storing data… In what arrow is better (execpted that it is a data hub for different language access)? If Arrow is intrinsequally better, could we imagine that Spark use it in replacement of its own system?

Yes, Spark could use Arrow more broadly, and is using Arrow in some areas such as PySpark as you mentioned before. If you look at the Plasma project this is happening in Ray. I think that over time more libraries will standardize on Arrow, as we are seeing in Python and eventually with R.

But these things are non-trivial to change. :slight_smile:

thanks
i’ll take a look ;…
regards

More on tools for Docker and Kubernetes here.