Remove or Archive Old Data

It’s not clear to me how we remove or archive old data in Dremio. E.g.:

  1. Remove raw data that are N years old and older but keep summary data derived from VDSs, or keep all VDSs
  2. Remove all raw and derived data B years old and older
  3. Keep only N years worth of data live and archive the rest, and is it possible to restore this data?

Thanks

My first thought would be to create separate virtual data set queries, one for each year’s data, then see if the derived datasets can refresh using a mix of old reflections of virtual data sets with the older data and newer reflections with the newer data. If the data will not be deleted from the physical data set, this could be a good start.

Next, if you had to delete from the physical data source older data, in theory you could point to the dremio-managed parquet files on the S3 or other data store, filtered to a max date aligning with the year you choose. Derived queries could be rewritten slightly to union Parquet to other data formats. From my view it could be as simple as making a copy of the parquet files into a new bucket, removing dremio_managed = true, then adding old data in parquet format back as a data source within dremio.

From my reading changing the physical data set’s structure in the database can invalidate any derived reflections. The documentation does not address removing stale data.

Incremental refresh seems to solve some issues automatically, should be configurable to allow detection of older data that has moved off the physical data store. I’m not sure if Dremio has a plan for that.

Virtual Datasets - Chaining Datasets suggests adding a column would require updates to downstream queries.

Many thanks for the ideas. In absence of proper support by Dremio I think I will have to try something like this.

@RayC

Apologies for missing this thread,

Dremio does not modify source data. VDS’s are just view (just like database views) and do not store data either. So if you want to archive/delete data from your source then you would need to do it directly on the source.

If you are using Dremio’s CTAS feature to write data back to the lake then that is definitely in your control as you can drop the CTAS table. Unfortunately cannot delete selectively,

Now, if you have reflections, then you can drive the business logic on the VDS on the number of months/years you want to reflect. Again, we currently do not have selective deletion in a reflection. You can drop a reflection via API or UI, but the physical Parquet files of the reflection would be deleted after 4 hours

Kindly let us know if you have any other questions and happy to help !

Thanks
Bali