Relections get stucked for many hours without faling

I setup 2 reflections for 2 specific use cases that we will have soon in production. The reflections seems working fine initially, then after 1 day or >8 hours, they get hanging running for many hours, or days without being completed neither failed.

We are using Dremio 19.1.0-202111160130570172-0ee00450 AWS Edition.

Our coordinator is an m5.2xlarge machine, and we have for now only 2 executors, also m5.2xlarge.

Here our dremio.conf

services.executor.enabled: false
debug.dist.caching.enabled: true
paths.local: "/var/lib/dremio"
# paths.results: "pdfs://"${paths.local}"/data/results"
paths.results: "dremioS3:///our_bucket/results"

registration.publish-host: "private_ip"
services.coordinator.master.embedded-zookeeper.enabled: false
zookeeper: "private_ip"
paths.accelerator = "dremioS3:///our_bucket/accelerator"
paths.uploads = "dremioS3:///our_bucket/uploads"
paths.downloads = "dremioS3:///our_bucket/downloads"
paths.scratch = "dremioS3:///our_bucket/scratch"
paths.metadata = "dremioS3:///our_bucket/metadata"
provisioning.ec2.efs.mountTargetIpAddress = "private_ip"
provisioning.coordinator.enableAutoBackups = "true"

Plus we enabled dremio.execution.support_unlimited_splits, dremio.iceberg.enabled and store.accurate.partition_stats.

The reflections are setup on dataset from AWS glue catalog, and we setup: hourly refresh, and 2 hours expiration.
Any hints to debug what is going on? As it is now, seems we cannot really leverage reflections at all…except disabling and re-enabling the reflection (manually of via API)…and this is a pity as we believe that they will be just perfect for our use case.

@nicor88 Can we please get profiles for those reflection refresh jobs?

@balaji.ramaswamy Here the reflections profiles, this week we enable more reflections (6 in totals), and they got stuck, all of them.

This morning I forcefully stop the executors that I created for reflections, for now 1 Ec2 machine (m5.2xlarge), after that the reflections were totally stopped, and a new refresh a successfully updated happen.
Specifically the profile that I’m attaching run for 25 hours without throwing an error. I also setup a mex running time for the reflections queues but didn’t help. (26.8 KB)

@balaji.ramaswamy did you manage to have a look at the above? to see if you spotted something weird?
As it is right now all the reflections keep hanging indefinitely till I kill the executor where they run into, initially for the first hours they seems working fine. But after a while they just hang. Is there anything to tweak? as far as I got the result is stored in S3 using iceberg.

I find something more about the above

Query cancelled by Workload Manager. Query runtime limit of 2700.00 seconds exceeded

Even if the query is cancelled by the workload manager it seems not properly marked as failed, therefore keeps hanging and any new reflections refresh are created.

Any idea on how to make the setup better?

@nicor88 If you look at the job profile, all phases are COMPLETED except phase 0, I see only a single node executing the query, are you able to send us the log from that executor when this issue happened. In how much time is this reflection expected to complete?

@balaji.ramaswamy I setup only one node for the reflections as we are dealing with small dataset.
I observed this behaviour for both reflections on top of S3 and redshift.
The refreshing time is order of seconds for redshift, and for s3 we have between 1 minute to 5 minutes.

Attached the profile of a reflection that is still running (>4h), and generally takes seconds.
Which logs shall I send? audit.log is empty in the executor. In attachment you can find the executors logs from server.logs.

Please let me know if you have any workaround to fix this. I was for example thinking about delete and recreate the reflection systematically if the run for more than one specific time…Or killing the executors systematically if a reflection get stuck. (2.9 KB) (10.8 KB)

Any inputs here?
So I tried so many things, and my final conclusion is that I cannot use reflections in a production environment, as they seems not ready…they keeping getting stuck on running forever loops. Upgrading Dremio to the latest version didn’t help neither. I was wondering if this is due to community edition or in the enterpise is the same story.

@nicor88 There is nothing interesting in the executor log as the work is all push down to Redshift. Would you know how many records are in the table "ods_production"."product"?

We have not more than 10k records in it at the moment. I supect that has something to do with the executors. The CPU usage for java process using top shown 600% in one executor. But we are using m5d.2xlarge, and so fare I just enable the reflection for the table mentioned by you. I don’t want to use super big nodes yet, if we have such small data.
The only way that I have in mind is to create an houskeeping job that kill the reflection executor time to time, only then those reflections seems to work.

@balaji.ramaswamy I think that I isolated the issue.

When a refleciton get stuck, if I look the processes happening in the executors, I see the CPU consumption from dremio user on java processes at > 100%, e.g. 600% till 1000%.

This could lead to dremio reflections to get stuck and pile up the reflection queue.
The only workaround that I have so far, is to kill periodically the reflections executors, but this is really bad practise. Are there some tweaking that we could do to prevent this?

@nicor88 I am not sure why the java process CPU goes high as work is push down to Redshift , any chance you can take a jstack when this happens and send me the jstack output along with the profile? jstack every second for about 60 times would help. it has to be taken on the executor