I wanted to see if there is anyway to assign custom tags on given datasets?
What would be an idea behind assigning custom tags?
You can always create virtual datasets with different names and save them into space/folder and use those in your queries.
it would be more geared towards using Dremio as a data catalog. Allowing users to add a description to a dataset and tags. Tags would help users to search when we get a lot of physical and virtual datasets.
Hey Russ - completely agree with you, and it’s good to hear how you’re thinking about using tags. We’ll be sure to include this in our internal spec. cc @can who man follow up with more questions.
We’re planning to make lots of improvements in this area so that it’s easy to search and find datasets, and so you can flat datasets as “vetted”, as well as many other things.
Thanks for the feedback!
that sounds good…something else that would be nice that holds back data discovery is to be able to easily connect datasets. To allow users to identify the best attributes (column names) to use for data integration on a given dataset would be a basic feature. You could also tie that in with controlled vocabulary for column names.
Hey Russ - have you noticed yet that Dremio will recommend joins based on the behavior of other users? When you’re in a dataset and click join, the first option will be a list of different joins that Dremio knows will work, sorted by popularity.
Is this aligned with what you had in mind, or are you talking about something different?
Could you say a little more about the controlled vocabulary idea you mentioned?
Thanks!
No i had not used that feature yet. So that is all based on historical joins?
As for controlled vocab, across the enterprise we strive to utilize standard vocabulary as much as possible.
If we could define common nomenclatures (ontologies) and then when a dataset is registered, Dremio could allow us to select from the common nomenclature list to rename the fields. At the same time, using a simple tag to let others know that this field is a good choice to join on.
I know some solutions are using machine learning to identify common fields between datasets, for the most part we have a good understanding of our datasets and it would be easy enough for us to identify this as we register them.
Thanks for the suggestions on controlled vocab!
To answer your question, yes, recommended joins are based on historical joins observed from any SQL query as well as data curation activity in the app. Here’s an example of what you might see after clicking on the join button for a dataset, where there are recommended joins to multiple datasets:
You can always create a custom join from here, of course, but these recommendations helpfully allow you to skip a few steps.
So in your case, one way to “register” a join is to simply use it in a query and it will automatically be registered and made as one of the recommendations.
Hey @Russ_Wilson thank you for the feedback! As Kelly mentioned, we’ll be adding on more data catalog type functionality in the near feature – we believe this adds a lot of value for teams using Dremio.
I’m curious to learn more about how you are addressing this problem today. Are you using a Data Catalog tool, an internal wiki, etc.? Out of the things you mentioned (tags, glossary, description, column annotations, etc.), which would you consider to be most useful?
Today we have a few solutions that have been created but primarily its wiki.
As for a priority here are a few that come to mind.
- dataset descriptions
- datasets tagging
- pre-configure recommended joins between datasets
- column annotations (description and tags?)
- require a common set of tags for all datasets
Thanks @Russ_Wilson!