I was wondering if you guys have plans or thoughts on using Dremio for process steps like “geocoding” or anything that involves (parallel) requests to a remote service for enrichment purposes.
I guess in general the concept of “staging” e.g. saving an intermediate table which is materialized after a processing seems crucial to the above? (and at the moment this is possible only in a shared by all space)
any thought welcome
This is most certainly possible with our Java UDFs by creating your own function. However, I would be cognizant how this is done as you easily hammer the external remote API service and thus get possibly blocked.
Antony, i am not sure i understand. I see one can write a customer user defined function but would not this API be then called each time that column is accessed? you see what i mean, without a first class citizen concept of “staging” or “saved table” its hard to think in practice how this would work sensibly on a large dataset.
Yes the UDF would be called each time. You are correct, this wouldn’t be practical on a large dataset. To create a “staging” or “saved table” I would actually recommend to create and use our Reflections (basically an accelerated physically optimized materialized representation of the data). More about that here - https://www.dremio.com/tutorials/getting-started-with-data-reflections/
Also, I posted in the other thread you replied to that Dremio has the ability to create table too.
However, creating any intermediate table would still require an initial call to an external API which may cause issues on large datasets irregardless…
I think this is in general an interesting topic… general title could be “How to enable Service based augmentation on a ELT paradigm” (would you agree?) . I mean in traditional ETL this is not an issue as its strongly based on the concept of staging
I guess there are two ways… either via a stronger $scratch concept (e.g. if you guys went toward a mixed model where materializing a table is… easy and securite - at the moment $scratch is shared) or via a concept of “virtual lookup table” … e.g. soemthing that feels like a table but really its a wrapper for a service cache (+ caching + throttling) something like 2 virtual columns “input/output”.
At the moment i would lean toward implementing something like this outside dremio but interested if this is something you guys talked about (retaining the good part of ETL)
We think of Dremio as ideal for “last mile” ETL - the lightweight, “just in time” transformations that people need frequently, but without making copies of the data. For heavyweight, long-running jobs, we recommend other tools like Spark or Hive.
Keep in mind that your staging tables could easily be files on the filesystem that Dremio can access. For example, in this tutorial data is geocoded by 1) reading unique addresses from the data, 2) calling the geocoding process, and 3) saving the results as a lookup file that Dremio can then access.
Here we used CSV but it would be more efficient to use Parquet.