Dremio goals are mainly to virtualize remote database access and to provide an ETL replacement for data preparation/cleansing.
When we do data science, or BI, we often need to build domain-specific datamarts involving processing functions we don’t find in Dremio. These datamart can’t be a dremio virtual dataset but need to be a physical database schema. In this case, we would need to install a data warehouse/datamart along Dremio, being what we wanted to avoid in fine.
We could avoid this by two means if it was integrated in dremio :
by being able to create virtaul dataset by using processing UDF on an existing Vdataset to produce a processed vdataset, for instance with spark, or pandas
by implementing in dremio a datamart for storing data ; it is a little bit the scheme Ignite takes by providing the two means of accessing Ignite, ie through virtualization or through a physiqcal ignite database
What is the dremio policy for this use case?
regards
Today, we already have the ability to extend our functions already via Java UDFs. It is in our roadmap to make this process simpler by exposing a public UDF SDK as well.
We also have the ability today to create a physical table that is based off of a VDS that then other tools can leverage - https://docs.dremio.com/sql-reference/sql-commands/create-table-as.html
You can use an external process to read from Dremio, apply the processing, and save the results as Parquet in a file system Dremio can access like S3, ADLS, and HDFS.
Does this address your needs?
It might be good to share some examples of the processing you have in mind.
THanks both of you for your replies.
It is a nice news to know that we can use UDF with dremio. It coudl be nice to have them in the UI too next to the standard functions , eg Create custom function .
Yes of course we can use an external process (but you have to have for instance hadoop installed) for create parquet files. But what it could be cool is to have the ability to create physical datasets in the UI along the sources and the virtual datasets, these physical datasets being used to insert or update rows. it is especially useful when we make data segmentation algorithms for instance where we use loops to inserts data on the flow.
regards
+1 to that one the ability to save intermediate table from the UI as we discussed in other threads
My use case is Geocoding and NLP e.g. having a UDF from the UI to say “geocode this” or NLP this and create a new table, next to the other sources that is not “recomputed” but instead its there “statically”, with ability to update/delete rows subsequently etc.
cheers
+1 I think this would be really an awesome feature. In a certain way you could then also use Dremio for a more classic ETL process.
Ideally you could choose a target, e.g. a database table and then use an adapted vds that can be used to insert/update records in the target based on some keys.
Today we are using lots of different tools/programs for that. e.g. mybatis for performing classic ETL, ELK stack to get data analyzed, but with lots of data enrichment before data is stored in elastic.
My dream has always been to have this one serves all, and I think dremio has the potential.