I downloaded NYC Taxi Trip record data (Jan 2010 - April 2022) to experiment some performance aspects of Dremio.
After uploaded to a folder within a MinIO bucket “samples/NYC-taxi-trip”, the dataset looked like this:
No. of files: 136
Min file size: 4.2 MiB
Max file size: 195.5 MiB
Total files size: 16.7 GiB
Total No. of records: 1.3B
No. of columns: 19
File format: Parquet
I have a single node Dremio installed on a guest VM with the following specs:
CPU: 16 v-core
RAM: 16 GB
Disk: 50 GB (SSD)
Also, I have standalone MinIO server installed on another guest VM with the following specs:
CPU: 2 v-core
RAM: 2 GB
Disk: 100 GB (SSD)
Both installations are default, except for Dremio where I changed this:
Following are 2 screenshots, one during promotion of the taxi dataset folder, and another after promotion:
And here’s the job details for running the above query:
When I submitted the default raw reflection creation job, Dremio machine was like continuously crunching it for over an hour and a half with CPU utilization of 70% - 100% all time:
And here’s the created reflection:
Last but not least, here’s the profile for the accelerator creation:
105f3fd1-a5e4-4e9d-8c7d-5fd06c9c1291.zip (24.1 KB)
I understand that the performance is improved (though not to the extent I wished, but let me play with this later):
The shock came when I checked the footprint for this reflection, where it was spread in 2 places: one in Dremio machine itself:
And other in the MinIO machine (dremio22 is the Dremio bucket defined in dremio.config dist variable):
I have few questions here:
- Is over 95 minutes of reflection creation time normal for the hardware specs I used?
- Anyone can think of any tricks on the infrastructural level to substantially reduce this time without really investing in more hardware?
- What is this 18.9 GB data in Dremio machine? Shouldn’t the reflection data be stored in MinIO bucket where it’s configured?
- The way I see it, overall footprint of the reflection created was 18.9 GB in Dremio machine and 39.6 GB in MinIO machine, or totally 58.5 GB - all spawn off of originally only 16.7 GB of actual data, or more than 3.5 folds! In other words, for every 1$ paid for data storage, there will be $3.5 paid extra for only raw reflection storage! Is that a hidden cost of Dremio that no one is taking note of?
- Other than reflection, is there anything tangible Dremio is doing to accelerate queries? I mean compared to Spark or Trino?