Creating custom dataset tags

Russ_Wilson · November 25, 2017, 4:02pm

I wanted to see if there is anyway to assign custom tags on given datasets?

yufeldman · November 25, 2017, 4:19pm

What would be an idea behind assigning custom tags?
You can always create virtual datasets with different names and save them into space/folder and use those in your queries.

Russ_Wilson · November 25, 2017, 5:02pm

it would be more geared towards using Dremio as a data catalog. Allowing users to add a description to a dataset and tags. Tags would help users to search when we get a lot of physical and virtual datasets.

kelly · November 25, 2017, 9:03pm

Hey Russ - completely agree with you, and it’s good to hear how you’re thinking about using tags. We’ll be sure to include this in our internal spec. cc @can who man follow up with more questions.

We’re planning to make lots of improvements in this area so that it’s easy to search and find datasets, and so you can flat datasets as “vetted”, as well as many other things.

Thanks for the feedback!

Russ_Wilson · November 26, 2017, 5:09pm

that sounds good…something else that would be nice that holds back data discovery is to be able to easily connect datasets. To allow users to identify the best attributes (column names) to use for data integration on a given dataset would be a basic feature. You could also tie that in with controlled vocabulary for column names.

kelly · November 26, 2017, 5:48pm

Hey Russ - have you noticed yet that Dremio will recommend joins based on the behavior of other users? When you’re in a dataset and click join, the first option will be a list of different joins that Dremio knows will work, sorted by popularity.

Is this aligned with what you had in mind, or are you talking about something different?

Could you say a little more about the controlled vocabulary idea you mentioned?

Thanks!

Russ_Wilson · November 26, 2017, 6:42pm

No i had not used that feature yet. So that is all based on historical joins?

As for controlled vocab, across the enterprise we strive to utilize standard vocabulary as much as possible.
If we could define common nomenclatures (ontologies) and then when a dataset is registered, Dremio could allow us to select from the common nomenclature list to rename the fields. At the same time, using a simple tag to let others know that this field is a good choice to join on.

I know some solutions are using machine learning to identify common fields between datasets, for the most part we have a good understanding of our datasets and it would be easy enough for us to identify this as we register them.

kelly · November 26, 2017, 7:20pm

Thanks for the suggestions on controlled vocab!

To answer your question, yes, recommended joins are based on historical joins observed from any SQL query as well as data curation activity in the app. Here’s an example of what you might see after clicking on the join button for a dataset, where there are recommended joins to multiple datasets:

You can always create a custom join from here, of course, but these recommendations helpfully allow you to skip a few steps.

So in your case, one way to “register” a join is to simply use it in a query and it will automatically be registered and made as one of the recommendations.

can · November 27, 2017, 6:34pm

Hey @Russ_Wilson thank you for the feedback! As Kelly mentioned, we’ll be adding on more data catalog type functionality in the near feature – we believe this adds a lot of value for teams using Dremio.

I’m curious to learn more about how you are addressing this problem today. Are you using a Data Catalog tool, an internal wiki, etc.? Out of the things you mentioned (tags, glossary, description, column annotations, etc.), which would you consider to be most useful?

Russ_Wilson · November 27, 2017, 9:59pm

Today we have a few solutions that have been created but primarily its wiki.

As for a priority here are a few that come to mind.

dataset descriptions
datasets tagging
pre-configure recommended joins between datasets
column annotations (description and tags?)
require a common set of tags for all datasets

can · November 28, 2017, 9:22pm

Thanks @Russ_Wilson!

Topic		Replies	Views
Column comments/descriptions/annotations	4	2423	May 21, 2020
How do you describe data for others	5	2456	July 3, 2019
View description	2	988	June 6, 2018
Dataset metadata/deployment features	9	1596	May 31, 2021
Creating Rules with tags in Workload Management	4	1356	April 4, 2021

Creating custom dataset tags

Related topics