Non-deterministic indexing error during virtual dataset catalog creation

Non-deterministic indexing error during virtual dataset catalog creation


Hi,

I’m encountering a non-deterministic indexing error on my platform related to Dremio. I have a dataset upload workflow set up which uploads a directory of parquet files to S3 and immediately attempts to index it with Dremio. This indexing involves creating a physical dataset catalog using the POST api/v3/catalog/{catalog_id} followed by a virtual dataset catalog (same API, different payload). This works correctly most of the time, but I’m encountering an issue with an upload pipeline.


physical_dataset_payload = {
                "entityType": "dataset",
                "path": physical_dataset_catalog_path,
                "type": "PHYSICAL_DATASET",
                "format": {"type": "Parquet"},
            }

virtual_dataset_payload = {
            "entityType": "dataset",
            "path": virtual_dataset_catalog_path,
            "type": "VIRTUAL_DATASET",
            "sql": <query>,
            "sqlContext": physical_dataset_path,
}

This pipeline generally produces indexable datasets. However, in about 10-15% of the cases, virtual catalog creation throws the following error -

com.dremio.dac.service.errors.NewDatasetQueryException: Unable to create dataset. Selected table has no columns.

And the dataset is not indexed. The bizarre thing is that this error happens seemingly randomly - and re-running the same workflow always results in a dataset which is indexable. If I go to this dataset on the dremio UI, I can always index it manually, and the data is correct and sane.

I’ve not been able to find any documentation/blog post which tells me how I can deal with this issue. Any assistance would be appreciated. Thank you!

@Sarthak-at-Ikigai When you say “indexing”, do you promote to a PDS, any reason you are promoting everytime? Do you remove format on the Dremio side before each load and again do your POST? If you are adding files, you just have to refresh metadata but if you are deleting the entire folder and doing it again then try this

  • Remove formatting on existing folder either through API or SQL
  • Once ETL is done, new folder created and files added
  • Add formatting aagain
    • Query the dataset