I’m reading your dremio details and thinking about if we can leverage dremio to improve performance of our sql-on-spark, here I have 2 questions:
How can dremio achieve high performance and even zero-copy when moving data via network ?
If we have some parquet files on HDFS generated by spark and queried by spark-sql, how can we migrate to dremio? How is the data flow when querying via dremio? hdfs -> arrow -> dremio engine ?
arrow supports moving data from spark to other systems, how is it implemented? Do we need to inject code into spark?
You can query the Parquet files directly using Dremio.
Dremio has a vectorized Parquet reader which should give you good performance accessing the physical data directly. Parquet is read into Arrow buffers directly for in memory execution. The Parquet can live on S3, HDFS, ADLS, or even NAS.
You may also find that Dremio can further improvr performance of certain query patterns through reflections. These could be aggregated, filtered, sorted, and/or sorted representations of your Parquet data. If you have Dremio create and manage these, it will also use Parquet. If you create and manage that externally via Spark or some other process you can register these as external reflections. Either way, Dremio’s query planner will consider all options and choose the lowest cost plan. This is similar to materialized views in some databases like Teradata.
Thanks a lot! Much clearer for me. So Dremio takes arrow as in memory store and parquet as persistence format which can live on S3, HDFS…, right? And reflection is like materialized views. In Dremio query engine itself, the computing happens on the arrow in-memory data directly, while for external reflection, it loads data from external to arrow storage firstly, am I right?
While, if I want to leverage arrow as the storage of spark-sql, in my understanding, it doesn’t make sense, right? Because every time I do some computation, I have to read arrow off-heap to spark jvm and then write back, it’s time and cpu consuming, right?
You’re right - Dremio reads any source into Arrow buffers for processing, whether that’s a data reflection or a supported source like Elastic or even Excel. All processing is in Arrow, even all the way to an ODBC client.
Another thing to consider is that Arrow doesn’t use snappy or other CPU-intensive compression, so it consumes more space than formats like Parquet. The reason is that for in-memory processing Arrow aims to be very CPU efficient so it uses things like dictionary encoding to save space without using so much CPU.
The way Dremio balances this for Data Reflections is to store them on disk as Parquet and to read them into Arrow buffers for processing. This is why the vectorized Parquet reader is so useful as it makes this process very, very fast.
Later this year you’ll start to see a better integration with Spark and Dremio.
What do you mean by “better integration with Spark and Dremio” ? Are you going to change the computation layer of spark to work on arrow memory directly?
We are working on an RPC layer for Arrow so that distributed Spark processes can access distributed Arrow buffers directly without going through an ODBC/JDBC layer.
This will also allow Spark-based processes to operated on the Arrow buffers directly without performing any serialization/deserialization. It will be a big improvement in terms of performance and efficiency.
Hi @shai.bet — suggest Parquet on cloud data lake storage (e.g., Amazon S3). Ideally, you’d have Dremio provisioned through the same cloud provider as your data, so if you have your data in Amazon S3 then best to have Dremio provisioned on EC2, close to your data (as opposed to something like Azure VMs).
I found that the best way to work with Parquet files on local storage is by using Dremio CTAS feature over any data source (Cloud, HDFS, NAS…)
CTAS will create Parquet files, but probably using some inner logic that more optimize to Dremio.
Hope you can elaborate on the actions take by Dremio when using CTAS Pls.