In a lot of my workflows, I am filtering data from a variety of local or remote sources using R locally and then I would like to upload the result set to my instance of Dremio to perform a join with a table on HDFS.
For instance, I may grab partner information from salesforce databases and then filter it down to 10K partner IDs that I’d like to get data for from the HDFS table.
I would normally hope to accomplish this by pumping the dealer IDs into a temp table in Dremio and then having Dremio perform a join on it in a subsequent SQL statement. I know I can write these IDs into a CSV and then upload it manually, but this hampers repeat-ability and I’d really like to avoid the manual steps.
The other solution would be (depending on size) to add the partner IDs to (where partner_id in (“x”…) ) the actual SWL statement being run, which would create the longest statement ever.
It also looks like there is a way to create tables via sql https://docs.dremio.com/sql-reference/sql-commands/tables.html so the concept exists, but RODBC’s sqlSave seems to fail with:
Query execution error. Details: PARSE ERROR: Failure parsing the query.
How can I do this?
Can you simply add the partner/dealer id’s file to HDFS, add HDFS as a Dremio source and perform the JOIN in Dremio? Or am I missing a restriction here?
Unfortunately, I have really complicated data workflows including getting data from different sources that dremio can’t/doesn’t connect to, APIs, and filtering / data manipulation steps in the middle that I don’t want to somehow engineer into dremio instead of R. Also it would encourage adoption of dremio if it can be implemented into the step of the workflow where it’s actually needed.
My example with partner ids is only a simple example to illustrate what I’m trying to do. The process is much more complicated.
Unless I’m missing something?
The user uploads feature is not exposed via SQL commands or the REST API, so to get a dataset into Dremio without visiting the UI, it would have to be part of some source (HDFS, local FS, MySQL etc) and then you would add that source to Dremio.
In the case of filesystem sources, you can also write new files to them if you’ve enabled CREATE TABLE AS (i.e. CTAS) exports on that source.
Or, you don’t want to consume your CTAS results with another tool, you can just use virtual datasets (i.e. VDS) to filter your data and then JOIN these with other sets (physical or virtual).
You should be able to submit CREATE TABLE AS or CREATE VDS statements to Dremio via your ODBC client to achieve this. When you submit the query that gives the syntax error from your R ODBC client, can you find the job in Dremio under the Jobs page?
Interesting. If I am correctly understanding you, this is what I gather:
For ad hoc data analyses, if you have some data in memory in R or Python and want to join it to some existing data store in HDFS:
You first need to write out the data to a table/file:
- In HDFS (perhaps using SparklyR or some other package),
- Or elsewhere (an Oracle table using ROracle, etc)
Then you need to make Dremio aware of the table/file that you have created (manually through the UI)
Then you must join the data sources either:
- Through a SQL statement issued by the ODBC,
- Or by creating a VDS (manually, or potentially through ODBC?)
And finally gather the data and pull it down into your local R or Python session by a running a select SQL statement through the ODBC connection.
Does that make sense / Is it accurate?
If correct, this seems like a pretty big pain point for users that may be an opportunity for Dremio to solve from a product standpoint! If not, I’d love to hear about other solutions to my issue!