I have Dremio 19.3.0 (community edition) installed on premise using Yarn deployment and connected to an HDP cluster. But since few days we identified that for Dremio, Yarn has consider some nodes (use as executor nodes) of the cluster as blacklisted. This seems different from from “Blacklisted nodes in Dremio”.
Those nodes are blacklisted only for Dremio application on the cluster and are specified on the port 45454 of the host. As a consequence all executor currently have only one virtual core instead of 4 requested.
Can someone here explain me what generates this situation and if possible how to solve it?
@alvinIce If Yarn has blacklisted the node, do you see that under Ambari? Is there a hardware or any other issue on that node? Do other Yarn applications like Spark,Hive etc use that node for query execution?
In fact, multiple nodes are blacklisted, but on Ambari, they are all heathy.
And yes, on the nodes, other solution like Spark and Hive are also used.
@alvinIce Do you have many job failures?
Yes, we do have lot of job failures of “BlockMissingException: Could not obtain block”
@alvinIce That is the reason Yarn is blacklisting, do we know why we are getting the
On Ambari, no missing block and no unavailable host while we get the blaclisted nodes and the job failures on Dremio.
That’s why I don’t understand why Dremio is acting like that. And more precisely, based on what, yarn detects for Dremio (and only) that some hosts are unhealthy.
Do you think that killing Dremio containers directly fron yarn web UI can be a root cause? We did that to check if this king of situation happen, if Yarn will provide other resources automatically to Dremio. And effectively, it did. But, we identified the blacklisted nod after trying 3 times that operation…
@alvinIce Dremio only requests containers to RM, If RM is unable to provide the resources we need to see why on the RM side. On your Dremio server.log on the coordinator, do you see messages that Dremio requested a certain number of vcores but only got one? The resource manager log should tell us why you are getting less than requested