Non-Trivial Reflection Storage Costs

Hi,

I downloaded NYC Taxi Trip record data (Jan 2010 - April 2022) to experiment some performance aspects of Dremio.

After uploaded to a folder within a MinIO bucket “samples/NYC-taxi-trip”, the dataset looked like this:

No. of files: 136
Min file size: 4.2 MiB
Max file size: 195.5 MiB
Total files size: 16.7 GiB
Total No. of records: 1.3B
No. of columns: 19
File format: Parquet

I have a single node Dremio installed on a guest VM with the following specs:

CPU: 16 v-core
RAM: 16 GB
Disk: 50 GB (SSD)

Also, I have standalone MinIO server installed on another guest VM with the following specs:

CPU: 2 v-core
RAM: 2 GB
Disk: 100 GB (SSD)

Both installations are default, except for Dremio where I changed this:

DREMIO_MAX_DIRECT_MEMORY_SIZE_MB=16384

Following are 2 screenshots, one during promotion of the taxi dataset folder, and another after promotion:


And here’s the job details for running the above query:

When I submitted the default raw reflection creation job, Dremio machine was like continuously crunching it for over an hour and a half with CPU utilization of 70% - 100% all time:
image

And here’s the created reflection:

Last but not least, here’s the profile for the accelerator creation:
105f3fd1-a5e4-4e9d-8c7d-5fd06c9c1291.zip (24.1 KB)

I understand that the performance is improved (though not to the extent I wished, but let me play with this later):
image

The shock came when I checked the footprint for this reflection, where it was spread in 2 places: one in Dremio machine itself:

And other in the MinIO machine (dremio22 is the Dremio bucket defined in dremio.config dist variable):

I have few questions here:

  1. Is over 95 minutes of reflection creation time normal for the hardware specs I used?
  2. Anyone can think of any tricks on the infrastructural level to substantially reduce this time without really investing in more hardware?
  3. What is this 18.9 GB data in Dremio machine? Shouldn’t the reflection data be stored in MinIO bucket where it’s configured?
  4. The way I see it, overall footprint of the reflection created was 18.9 GB in Dremio machine and 39.6 GB in MinIO machine, or totally 58.5 GB - all spawn off of originally only 16.7 GB of actual data, or more than 3.5 folds! In other words, for every 1$ paid for data storage, there will be $3.5 paid extra for only raw reflection storage! Is that a hidden cost of Dremio that no one is taking note of?
  5. Other than reflection, is there anything tangible Dremio is doing to accelerate queries? I mean compared to Spark or Trino?

@yalmasri

  • The bulk of the time ~ 40 minutes, is spent on PARQUET_WRITER and each thread (total 6) is writing ~ 220 Million records
  • All the compute is happening on a single node so CPU contention for the same amount of time was present
  • The 18.9 GB of files on local is the local cache so next time you run the query, it will resd of local instead of MinIO Dremio
  • There will be multiple materializations on disk, if you go to your minIO accelerator folder and look under b28ddf70-2652-4f47-a770-b85dcf4a9d3c, how many sub folders do you see with a 16 digit UUID? under each of them you will see Parquet files
  • If your original data is also PARQUET on minIO and your reflection is also a select all rows all columns from that dataset and again storing in minIO, there is no big advantage, instead consider creating an agg reflection assuming your dashboards will be reports that have dimensions and measures, even if you are creating a raw, does dash board need all the columns and all the rows, single node compute is another bottleneck

Thanks Balaji, I think the picture is clear now.

image

I have 8 virtual cores on my machine backed by 4 physical ones.

Threads #2 & #3 above are almost processing without wait time - completely utilizing 2 physical cores, while the other 4 threads are contending on the remaining 2 cores with almost equal wait-to-process time. And because this is CPU-intensive, I’m not benefiting from virtualization.

If you agree with me, this also tells me that MinIO was not a bottleneck at any point of time and having MinIO deployed in a distributed mode would not have helped. Even running Dremio in a clustered deployment (using same specs) wouldn’t have affected the result as well (please confirm this understanding).

Here’s the answer (there are 150 parquet files in there)

Yes, I think the performance increase I’ve seen is because of internal caching. Parquet is a very optimized format and not much can be done to squeeze it further (tried to gzip all of the 16.7 GB parquet files and couldn’t get more than few additional MB of space).

No, the dashboard would not need every row and column but as an anchor reflection we do need to include everything

@yalmasri Sorry about the silence, do you still have performance issues?