Non-Trivial Reflection Storage Costs

yalmasri · August 16, 2022, 7:03pm

Hi,

I downloaded NYC Taxi Trip record data (Jan 2010 - April 2022) to experiment some performance aspects of Dremio.

After uploaded to a folder within a MinIO bucket “samples/NYC-taxi-trip”, the dataset looked like this:

No. of files: 136
Min file size: 4.2 MiB
Max file size: 195.5 MiB
Total files size: 16.7 GiB
Total No. of records: 1.3B
No. of columns: 19
File format: Parquet

I have a single node Dremio installed on a guest VM with the following specs:

CPU: 16 v-core
RAM: 16 GB
Disk: 50 GB (SSD)

Also, I have standalone MinIO server installed on another guest VM with the following specs:

CPU: 2 v-core
RAM: 2 GB
Disk: 100 GB (SSD)

Both installations are default, except for Dremio where I changed this:

DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=16384

Following are 2 screenshots, one during promotion of the taxi dataset folder, and another after promotion:

And here’s the job details for running the above query:

When I submitted the default raw reflection creation job, Dremio machine was like continuously crunching it for over an hour and a half with CPU utilization of 70% - 100% all time:

And here’s the created reflection:

Last but not least, here’s the profile for the accelerator creation:
105f3fd1-a5e4-4e9d-8c7d-5fd06c9c1291.zip (24.1 KB)

I understand that the performance is improved (though not to the extent I wished, but let me play with this later):

The shock came when I checked the footprint for this reflection, where it was spread in 2 places: one in Dremio machine itself:

And other in the MinIO machine (dremio22 is the Dremio bucket defined in dremio.config dist variable):

I have few questions here:

Is over 95 minutes of reflection creation time normal for the hardware specs I used?
Anyone can think of any tricks on the infrastructural level to substantially reduce this time without really investing in more hardware?
What is this 18.9 GB data in Dremio machine? Shouldn’t the reflection data be stored in MinIO bucket where it’s configured?
The way I see it, overall footprint of the reflection created was 18.9 GB in Dremio machine and 39.6 GB in MinIO machine, or totally 58.5 GB - all spawn off of originally only 16.7 GB of actual data, or more than 3.5 folds! In other words, for every 1$ paid for data storage, there will be $3.5 paid extra for only raw reflection storage! Is that a hidden cost of Dremio that no one is taking note of?
Other than reflection, is there anything tangible Dremio is doing to accelerate queries? I mean compared to Spark or Trino?

balaji.ramaswamy · August 16, 2022, 10:31pm

@yalmasri

The bulk of the time ~ 40 minutes, is spent on PARQUET_WRITER and each thread (total 6) is writing ~ 220 Million records
All the compute is happening on a single node so CPU contention for the same amount of time was present
The 18.9 GB of files on local is the local cache so next time you run the query, it will resd of local instead of MinIO Dremio
There will be multiple materializations on disk, if you go to your minIO accelerator folder and look under b28ddf70-2652-4f47-a770-b85dcf4a9d3c, how many sub folders do you see with a 16 digit UUID? under each of them you will see Parquet files
If your original data is also PARQUET on minIO and your reflection is also a select all rows all columns from that dataset and again storing in minIO, there is no big advantage, instead consider creating an agg reflection assuming your dashboards will be reports that have dimensions and measures, even if you are creating a raw, does dash board need all the columns and all the rows, single node compute is another bottleneck

yalmasri · August 17, 2022, 11:58am

Thanks Balaji, I think the picture is clear now.

I have 8 virtual cores on my machine backed by 4 physical ones.

Threads #2 & #3 above are almost processing without wait time - completely utilizing 2 physical cores, while the other 4 threads are contending on the remaining 2 cores with almost equal wait-to-process time. And because this is CPU-intensive, I’m not benefiting from virtualization.

If you agree with me, this also tells me that MinIO was not a bottleneck at any point of time and having MinIO deployed in a distributed mode would not have helped. Even running Dremio in a clustered deployment (using same specs) wouldn’t have affected the result as well (please confirm this understanding).

Here’s the answer (there are 150 parquet files in there)

Yes, I think the performance increase I’ve seen is because of internal caching. Parquet is a very optimized format and not much can be done to squeeze it further (tried to gzip all of the 16.7 GB parquet files and couldn’t get more than few additional MB of space).

No, the dashboard would not need every row and column but as an anchor reflection we do need to include everything

balaji.ramaswamy · September 30, 2022, 10:08am

@yalmasri Sorry about the silence, do you still have performance issues?

yalmasri · September 30, 2022, 2:17pm

No, I’m content that my machine was doing what it should to create that gigantic reflection. Thanks for caring…

Topic		Replies	Views
Evaluating Dremio	3	2112	May 17, 2018
Why use reflection on reading data from S3?	2	2766	September 15, 2018
Storage required for reflection Dremio University	3	2350	September 3, 2020
Large Reflection creation, speed and performance	4	2246	April 16, 2019
Dremio on kubernate volume mount take more time to load data	8	1776	November 20, 2018

Non-Trivial Reflection Storage Costs

Related topics