BI tools performance, especially on Windows platform

Hi,

Thanks for very interesting product.

I have a question - can I get somehow the same performance with Power BI or Qlik integration as I have with their internal columnar storages?

My setup is Windows Dremio.
Am using sample SF incidents dataset (which is very small, it is about 31 MB uncompressed).
Both reflections (raw and cube) are enabled. Sorting and partitions are not enabled, could not found the documentation for these options.

I tested the performance in Power BI (load and direct) and Qliksense - direct.
Performance is much slower for direct connections, then it would be for loaded data.
Load model in Power BI works fasts (because data is loaded into internal columnar storage of SSAS), but refresh time is very slow thru ODBC. I think that MySQL table with imported dataset works faster, without any optimizations.

When I test Apache Arrow implementation for Python/Pandas (called Feather) - read and write speeds are incredible. Feather writes dataframes at the speed approx 250MB/sec and reads at approx 550MB/sec on my pc with average specs. Basically, i can use the SSD as RAM extension for Pandas, especially if i switch to these newish SSDs with 3200 MB/sec support.
For example, I can read/write SF incidents file to/from disk in fraction of second, using Feather.

Qlik format, called qvd also works great for reads/writes (it similar to Feather, but compressed, thus a little bit slower). I can load csv to Qlik, then export it to compressed qvd and after that refreshes will be almost instant.

I understand, that Dremio for Windows is not intended for production. But assuming I have a powerful Linux machine(s) and 10 gbit network to Windows machine with Tableau, Power BI, Qliksense or Pandas - will I ever achieve the speed of Feather reads/writes?

In another words, there are three great file formats/technologies - Qlik QVD, Arrow (Feather) and internal columnar storages in BI tools, which work with huge speed on my machine.
Is there any way/setup/approach to achieve the same speed with Dremio-BI integration?

Thanks for your opinions and suggestions!

.

Hi Slava.

The short answer is yes, you should be able to get similar performance, but for much, much larger datasets, and without having to build extracts or load the data into the in-memory engine. The way Qlik works currently there isn’t an easy way to do direct queries, but that may change.

You say you have reflections enabled on the SFIncidents dataset. When you look at the query profiles, are your queries being accelerated. You should see sub-second query responses on this dataset when using reflections.

Dremio’s query execution engine is based on Apache Arrow. In fact, Arrow buffers are streamed from Dremio to the ODBC driver on the client.

I’m about to publish a tutorial on data reflections and will post that here once it is published.

Here’s the new tutorial: Getting Started With Data Reflections. Let me know if you have any feedback. :slight_smile:

There are more advanced topics we want to cover on this topic, and that’s a work in progress.

Kelly

Thanks.
I will take this tutorial for sure and will provide detailed feedback later.
(I hope that i can post dropbox or youtube links here, so i can provide some videos and screenshots, if needed)

I’ve played with Wes McKinney examples (Pandas<->Feather, Pandas<->Parquet, Pandas<-> Arrow (in-memory) - the performance is stunning. If 50% of this can be achieved in BI<->ODBV<->Dremio reflections (without clustering) - it would be great.