We have an on premise Hadoop infrastructure . Our business folks also have started storing the data in S3 in AWS.
Can Dremio help us in joining on premise HDFS data with S3 data on Amazon cloud ?
Does anyone have any experience with that ?
Yes! The is definitely a primary & common use case for Dremio. Here is a tutorial on how to join various data sources (except 1 would be on-prem HDFS while other will be cloud S3) - https://www.dremio.com/tutorials/combining-data-from-multiple-datasets/
Thanks a lot. I will go through it
One thing to keep in mind - Data Reflections can help to keep the analytics on one side of the network. So, you could run Dremio in your Hadoop cluster, and have the reflections live in HDFS, or you could deploy Dremio on AWS and have the reflections live in S3. If you’re always joining across AWS and on-prem, you’re going to have non-trivial network latency and your experience might be so-so.