Hello,
I am using Dremio version 24.3.2
I am trying to create a raw reflection from a view that makes some calculations and joins from some other raw tables.
In total, all the tables together use 160 MB of Blob storage space, and the total number of files is 3000.
My data source is Azure Blob Storage.
Whenever I run the view query, it takes around 30-60 seconds.
Now I want to activate a reflection for this view, so that it takes less than 5 seconds to retrieve the data every time I run it.
The problem is that, whenever I try to create a reflection it processes the job for many hours, and it never finishes. For example, I have tried running the reflection creation 2 times. One for 66hours and another time for 17 hours. None of them succeeded.
I am using a 4 core 32 GM RAM single node.
I have tried creating a raw reflection for one of the tables that is used in the view, and it succeeded after 1 hour.
Is there a limitation with views that I should be aware of?
I have tried every solution that I could find on this forum, but without success.
Any help would be greatly appreciated.
Thanks
Here is the last query profile I performed
d1b83bd4-b4f0-4595-aeac-64b790b7d65d.zip (72.8 KB)
@rrosenzvaig On 24.x this path should have used Iceberg Metadata but instead because the below setting have been disabled (by default true)
{
"kind" : "BOOLEAN",
"type" : "SYSTEM",
"name" : "dremio.execution.support_unlimited_splits",
"bool_val" : false
}, {
"kind" : "BOOLEAN",
"type" : "SYSTEM",
"name" : "dremio.iceberg.enabled",
"bool_val" : false
},
In this path, metadata refresh is on the coordinator and every refresh is a Full refresh vs in the new flow it is shared by all the executors just like a query and the metadata part is incremental.
I would be interested to see your query profile without the reflection as that would also have used the old code path
In this profile, PARQUET_WRITER is taking all the time butI see no metrics. I need to see the profile with unlimited splits turned on
Do you have the steps? It is just not turning the above keys but your metadata will start going to your dist storage and you would have to compute metadata for all PDS by doing a forget and refresh. In fact in the next version these keys will be dropped
Thanks
Bali
Thanks for the response.
I am testing thew configs on a test server. I did a reset on the entire environment, and I am going to try the reflection again.
As soon as I can, I share the query plan as well.
Is there anything else I should try?
@balaji.ramaswamy I have the new query profile after reseting all settings.
940e8fe8-af5a-4901-b83f-8ab08540adc0.zip (55.0 KB)
All the time is spent on PARQUET_WRITER and the record processing rate is extremely slow. If you see all the time is spent on process so this is not IO, It looks like 2af9eea3408a
is a 4 core box and that could be the issue as the PARQUET_WRITER is writing 7 million per thread. Still the record processing rate seems slow
When you run this reflection and after 30 minutes, is it possible to get a thread dump about 1 every second for say 5 minutes
#!/bin/bash
for i in seq -w 3 1 300
do
jstack -l dremio-pid > ThreadDump$i.txt
sleep 1
done
- chmod +x jstack.sh
./jstack.sh
tar -cvf jstack.tar *.txt
gzip jstack.tar
And send the jstack.tar.gz file
Ok. Thanks a lot for the sugestion. I will run this test and share the results as soon as possible
@balaji.ramaswamy Here is the file
jstack.tar.zip (4.2 MB)
@rrosenzvaig Sorry my bad, please also send me the profile of the job that was running when you took the thread dumps as I need to trace the job ID in the thread dump
Sure. It is either 1a0f5213-d05d-3d69-c0e5-d4a018bdd800 or 1a0e6326-7327-17ba-72b1-9fc4893f2100
@rrosenzvaig Do not see both Job IDs in the ThreadDump files. Can you please send me the profile of the Reflection refresh that never finished when this thread dump was taken?