In katiML, tags are name-value pairs that can be attached to datapoints. They help selecting and managing your metadata.
Tags are created and updated with a dictionary named tags in the datapoint object. This ensures that each tag name (tags.name) is unique for a datapoint (datapoints.id):
from dioptra.lake.utils import upload_to_lake
upload_to_lake({
"id": ...,
"tags": {
# Add or update a tag with name "foo"
"foo": "bar",
# Set to null to delete the tag with name "baz"
"baz": null,
...
},
"predictions": [...]
})
Tags structure
Tags are a child table of datapoints. As such, you can retrieve them and use them to filter datapoints with the appropriate arguments specified.
Tags can be used anywhere you use datapoint filters. For example, the following filters will select all datapoints with tags source: stanford_dogs AND dataset: train.
The dataframe returned by select_datapoints contains datapoints and a column named tags corresponding to the requested child table tags.* which are the tags attached to the datapoints we are selecting.
We want to explode the datapoints dataframe along the tags column to have a flat list of tags. We then turn each tag dictionary into a row with .apply(pd.Series)
The terminal prints something like this: a dataframe of tags.
id name value datapoint organization_id
0 1ae13d87-abcf-475f-87f0-78381960499f source stanford_dogs 55a9f19b-8723-48d0-a235-cddaf50e4f38 63ee72748d1ad3fb82cec9ab
0 4de84517-f53f-4182-8015-7e84c9ebc350 dataset train 55a9f19b-8723-48d0-a235-cddaf50e4f38 63ee72748d1ad3fb82cec9ab
1 6fafb890-30aa-41e1-b92c-897caf509e9f dataset test bf80ca4d-a8de-4f2e-8970-3b5beb4958ae 63ee72748d1ad3fb82cec9ab
1 b070dedc-6bf0-4d74-9519-ddc84f84e5b9 source stanford_cats bf80ca4d-a8de-4f2e-8970-3b5beb4958ae 63ee72748d1ad3fb82cec9ab
Grouping Tags by Name and Value
Next we'll group the tags by name and value so we can select groups of datapoints.
Here we use the pandas grouping operators to aggregate the datapoint column into a list of datapoints by unique value of tags name and value.
datapoint
name value
dataset test [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
train [55a9f19b-8723-48d0-a235-cddaf50e4f38]
source stanford_cats [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
stanford_dogs [55a9f19b-8723-48d0-a235-cddaf50e4f38]
Selecting Datapoints Based on Tag values
datapoint_ids_tagged_cats = grouped_tags.loc['source', 'stanford_cats']['datapoint']
# Get the datapoints that are tagged as cats in the datapoints frame.
print(datapoints[datapoints['id'].isin(datapoint_ids_tagged_cats)])