Non-deterministic indexing error during virtual dataset catalog creation
Hi,
I’m encountering a non-deterministic indexing error on my platform related to Dremio. I have a dataset upload workflow set up which uploads a directory of parquet files to S3 and immediately attempts to index it with Dremio. This indexing involves creating a physical dataset catalog using the POST api/v3/catalog/{catalog_id}
followed by a virtual dataset catalog (same API, different payload). This works correctly most of the time, but I’m encountering an issue with an upload pipeline.
physical_dataset_payload = {
"entityType": "dataset",
"path": physical_dataset_catalog_path,
"type": "PHYSICAL_DATASET",
"format": {"type": "Parquet"},
}
virtual_dataset_payload = {
"entityType": "dataset",
"path": virtual_dataset_catalog_path,
"type": "VIRTUAL_DATASET",
"sql": <query>,
"sqlContext": physical_dataset_path,
}
This pipeline generally produces indexable datasets. However, in about 10-15% of the cases, virtual catalog creation throws the following error -
com.dremio.dac.service.errors.NewDatasetQueryException: Unable to create dataset. Selected table has no columns.
And the dataset is not indexed. The bizarre thing is that this error happens seemingly randomly - and re-running the same workflow always results in a dataset which is indexable. If I go to this dataset on the dremio UI, I can always index it manually, and the data is correct and sane.
I’ve not been able to find any documentation/blog post which tells me how I can deal with this issue. Any assistance would be appreciated. Thank you!