Benefits of YARN over standalone cluster deployment

I’m evaluating different deployment options of an on-prem solution and would love some feedback on two different scenarios. Which would be most beneficial in terms of performance and cost of governance / maintenance.

All data are on-prem in S3, MySQL, ELK and SQL Server.
I have access to 3-4 machines that could be deployed in the following two manners:

  1. A YARN-deployent with one Master Node and two to three Worker nodes. No data that will be queried by Dremio will probably be stored on HDFS.
  2. A simple Dremio cluster with one Coordinator node and two to three Executor nodes.

My initial question is; do I really need a Hadoop deployment if all data are stored in other repos? I guess that HDFS could be utilized to store reflections but would this give a significant performance boost over just going with a simple Dremio cluster?

Any thought or feedback is highly appreciated.

The main advantage of Yarn is that it simplifies deploying new versions, however its usually best to go with the simplest deployment model. You can use HDFS to store reflections/etc even if you are not deployed on Yarn btw.

1 Like