BI tools performance, especially on Windows platform

Slava · December 16, 2017, 2:41pm

Hi,

Thanks for very interesting product.

I have a question - can I get somehow the same performance with Power BI or Qlik integration as I have with their internal columnar storages?

My setup is Windows Dremio.
Am using sample SF incidents dataset (which is very small, it is about 31 MB uncompressed).
Both reflections (raw and cube) are enabled. Sorting and partitions are not enabled, could not found the documentation for these options.

I tested the performance in Power BI (load and direct) and Qliksense - direct.
Performance is much slower for direct connections, then it would be for loaded data.
Load model in Power BI works fasts (because data is loaded into internal columnar storage of SSAS), but refresh time is very slow thru ODBC. I think that MySQL table with imported dataset works faster, without any optimizations.

When I test Apache Arrow implementation for Python/Pandas (called Feather) - read and write speeds are incredible. Feather writes dataframes at the speed approx 250MB/sec and reads at approx 550MB/sec on my pc with average specs. Basically, i can use the SSD as RAM extension for Pandas, especially if i switch to these newish SSDs with 3200 MB/sec support.
For example, I can read/write SF incidents file to/from disk in fraction of second, using Feather.

Qlik format, called qvd also works great for reads/writes (it similar to Feather, but compressed, thus a little bit slower). I can load csv to Qlik, then export it to compressed qvd and after that refreshes will be almost instant.

I understand, that Dremio for Windows is not intended for production. But assuming I have a powerful Linux machine(s) and 10 gbit network to Windows machine with Tableau, Power BI, Qliksense or Pandas - will I ever achieve the speed of Feather reads/writes?

In another words, there are three great file formats/technologies - Qlik QVD, Arrow (Feather) and internal columnar storages in BI tools, which work with huge speed on my machine.
Is there any way/setup/approach to achieve the same speed with Dremio-BI integration?

Thanks for your opinions and suggestions!

.

kelly · December 16, 2017, 11:37pm

Hi Slava.

The short answer is yes, you should be able to get similar performance, but for much, much larger datasets, and without having to build extracts or load the data into the in-memory engine. The way Qlik works currently there isn’t an easy way to do direct queries, but that may change.

You say you have reflections enabled on the SFIncidents dataset. When you look at the query profiles, are your queries being accelerated. You should see sub-second query responses on this dataset when using reflections.

Dremio’s query execution engine is based on Apache Arrow. In fact, Arrow buffers are streamed from Dremio to the ODBC driver on the client.

I’m about to publish a tutorial on data reflections and will post that here once it is published.

kelly · December 17, 2017, 12:24am

Here’s the new tutorial: Getting Started With Data Reflections. Let me know if you have any feedback.

There are more advanced topics we want to cover on this topic, and that’s a work in progress.

Kelly

Slava · December 18, 2017, 2:08pm

Thanks.
I will take this tutorial for sure and will provide detailed feedback later.
(I hope that i can post dropbox or youtube links here, so i can provide some videos and screenshots, if needed)

I’ve played with Wes McKinney examples (Pandas<->Feather, Pandas<->Parquet, Pandas<-> Arrow (in-memory) - the performance is stunning. If 50% of this can be achieved in BI<->ODBV<->Dremio reflections (without clustering) - it would be great.

Topic		Replies	Views
Dremio Architecture query	6	1640	March 5, 2019
Performances comparisons	18	11768	February 1, 2021
How does dremio move data?	10	3077	July 13, 2021
What Dremio is exactly?	3	9090	April 22, 2019
Improving performance with MS SQL	5	2823	October 31, 2017

BI tools performance, especially on Windows platform

Related topics