Support for Create table as Select (CTAS)

We frequently do data cleansing and also analytics pipelines by gathering raw tables then doing a sequence of sql statement transformations to produce a final output table (and potentially intermediate tables as the pipeline progresses). The pipeline is still stateless, meaning we can always build the outputs from the given inputs and there are no UPDATE statements. But we would normally use a number of CREATE TABLE xxx_step1 as SELECT ** from yy where / group etc.

Knowing that Dremio does not support INSERT or UPDATE - is it possible to do these kinds of create tables statements? Is there another approach you would recommend for such a pipeline process?

We love Dremio but this one is a challenge when the pipelines are complex. Given the current limitation of the Dremio platform the only obvious solution would be to load Dremio then export intermediate steps then load the results back - which seems very cumbersome.

Thanks

1 Like

Hi Malcolm,

Why couldn’t you create a virtual data set per table. You can derive virtual data sets from other virtual data sets. This way you can create “pipelines” of SQL commands. Dremio also supports nested SELECT statements.

Am I missing something?

Christy

Interesting -
Is there a way to create virtual datasets through the ODBC connector or via some other API? The pipelines we build usually run through a script of some kind.

Your example

should actually be supported today - https://docs.dremio.com/sql-reference/sql-commands/create-table-as.html

And yes VDS can be created via our REST API in our latest release - https://docs.dremio.com/rest-api/catalog/endpoints.html

Can I build acceleration/reflections on top of the $scratch location where a CTAS table is created. This seems like the best possible solution to our use-case. Very exciting.

Sure of course. Per our documentation, the $scratch points to /scratch directory under Dremio's configured distributed cache location. As a next step, you would create a VDS off of that /scratch directory and then a reflection on top of that.

It would be great if you could report back on your final solution so others can learn. Sounds like a great use case for the new REST APIs.