We have successfully connected dremio on ec2 to emr and hive tables are visible in the datasets. when we run a query against this dataset, query runs for about 10 mins and comes back with a error stating “cannot connect to dremio server”
Could you please look and/or share your query profile?: How To Share A Query Profile
Also do you have any other sources configured and can run queries against them successfully?
If your hive is on EMR we may not support it out of the box at the moment. Sorry for inconvenience.
Try to add following jar(s) to Dremio classpath: emrfs-hadoop-assembly-2.18.0.jar, jets3t-0.9.0.jar
Not sure it may be enough, but you can take a look at query profile/error to see if anything else is missing.
Your current error is:
we are finally able to query hive external table on emr. protocol used shd be s3a for location instead of s3. This made a difference. Thanks Again for your help
we are still encountering some issues. For a small dataset that is “7 Objects - 23.1 MB” we are able to query an external hive table that is located in S3 in parquet format. This table is partitioned. For a dataset that is “196 Objects - 27.7 MB”. This is the profile.
Here is the cause of your error: Caused By (com.amazonaws.SdkClientException) Unable to execute HTTP request: Timeout waiting for connection from pool
Looks like timeout on the connection to s3 bucket. Do you get it consistently or intermittently? Do you see any pattern - like you get it when you do large query and do not get it if you query data that resides in a single partition?
We are able to read the data fine with hive and when we use the same external hive table with Presto. I am guessing Dremio can handle very large data sizes. Our hive tables are partitioned as well. Does Dremio have any settings like Presto for Hive s3. like hive.s3.connect-timeout and hive.s3.max-retry-time
I take my words back - about inablility to set those properties. I confused hive source with pure s3 source. You should be able to set whatever config properties (including timeouts and what not) while adding/modifying hive source