Managing Datapoints with Tags
In katiML, tags are name-value pairs that can be attached to datapoints. They help selecting and managing your metadata.
Tags are created and updated with a dictionary named tags
in the datapoint object. This ensures that each tag name (tags.name
) is unique for a datapoint (datapoints.id
):
from dioptra.lake.utils import upload_to_lake
upload_to_lake({
"id": ...,
"tags": {
# Add or update a tag with name "foo"
"foo": "bar",
# Set to null to delete the tag with name "baz"
"baz": null,
...
},
"predictions": [...]
})
Tags structure
Tags are a child table of datapoints. As such, you can retrieve them and use them to filter datapoints with the appropriate arguments specified.
datapoints_dataframe = select_datapoints(filters=[...], fields=[...])
tags.name
The name of the tag. Unique for a datapoint.
tags.value
The value of the tag.
tags.datapoint
The datapoint id this tag is attached to.
Tags Usage
Tags can be used anywhere you use datapoint filters. For example, the following filters will select all datapoints with tags source: stanford_dogs
AND dataset: train
.
[{
"left": "tags.name",
"op": "=",
"right": "source"
}, {
"left": "tags.value",
"op": "=",
"right": "stanford_dogs"
}, {
"left": "tags.name",
"op": "=",
"right": "dataset"
}, {
"left": "tags.value",
"op": "=",
"right": "train"
}]
We'll illustrate the usage of tags with the following code:
import os
import pandas as pd
# os.environ['DIOPTRA_API_KEY'] = 'DIOPTRA_API_KEY'
from dioptra.lake.utils import upload_to_lake, wait_for_upload, delete_datapoints, select_datapoints
# Upload your metadata.
upload_id = upload_to_lake(records=[{
'metadata': {
'uri': 'https://dog.jpg'
},
'type': 'IMAGE',
'groundtruth': {
'task_type': 'CLASSIFICATION',
'class_name': 'chihuahua'
},
'tags': {
'source': 'stanford_dogs',
'dataset': 'train'
}
}, {
'metadata': {
'uri': 'https://cat.jpg'
},
'type': 'IMAGE',
'groundtruth': {
'task_type': 'CLASSIFICATION',
'class_name': 'bengal'
},
'tags': {
'source': 'stanford_cats',
'dataset': 'test'
}
}])
upload = wait_for_upload(upload_id)
# Select datapoints with a filter allowing multiple tag values.
datapoints = select_datapoints(fields=[
'id', 'tags.*', 'metadata.uri', 'groundtruths.class_name'
], filters=[{
'left': 'tags.name',
'op': '=',
'right': 'source'
}, {
'left': 'tags.value',
'op': 'in',
'right': ['stanford_dogs', 'stanford_cats']
}])
# Retrieve a list of all tags on the selected datapoints.
tags_df = datapoints.explode('tags')['tags'].apply(pd.Series)
print(tags_df)
# Group tags by name and value so we can filter on those.
grouped_tags = tags_df.groupby(['name', 'value'])[['datapoint']].agg(list)
print(grouped_tags)
# Get the datapoints that are tagged as cats in the datapoints frame.
datapoint_ids_tagged_cats = grouped_tags.loc['source', 'stanford_cats']['datapoint']
print(datapoints[datapoints['id'].isin(datapoint_ids_tagged_cats)])
Retrieving the list of tags
Assuming you went through the Quick Start and Ingestion Basics, let's review the following line:
tags_df = datapoints.explode('tags')['tags'].apply(pd.Series)
The dataframe returned by select_datapoints
contains datapoints and a column named tags
corresponding to the requested child table tags.*
which are the tags attached to the datapoints we are selecting.
We want to explode
the datapoints
dataframe along the tags
column to have a flat list of tags. We then turn each tag dictionary into a row with .apply(pd.Series)
The terminal prints something like this: a dataframe of tags.
id name value datapoint organization_id
0 1ae13d87-abcf-475f-87f0-78381960499f source stanford_dogs 55a9f19b-8723-48d0-a235-cddaf50e4f38 63ee72748d1ad3fb82cec9ab
0 4de84517-f53f-4182-8015-7e84c9ebc350 dataset train 55a9f19b-8723-48d0-a235-cddaf50e4f38 63ee72748d1ad3fb82cec9ab
1 6fafb890-30aa-41e1-b92c-897caf509e9f dataset test bf80ca4d-a8de-4f2e-8970-3b5beb4958ae 63ee72748d1ad3fb82cec9ab
1 b070dedc-6bf0-4d74-9519-ddc84f84e5b9 source stanford_cats bf80ca4d-a8de-4f2e-8970-3b5beb4958ae 63ee72748d1ad3fb82cec9ab
Grouping Tags by Name and Value
Next we'll group the tags by name and value so we can select groups of datapoints.
grouped_tags = tags_df.groupby(['name', 'value'])[['datapoint']].agg(list)
Here we use the pandas grouping operators to aggregate the datapoint
column into a list of datapoints by unique value of tags name
and value
.
datapoint
name value
dataset test [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
train [55a9f19b-8723-48d0-a235-cddaf50e4f38]
source stanford_cats [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
stanford_dogs [55a9f19b-8723-48d0-a235-cddaf50e4f38]
Selecting Datapoints Based on Tag values
datapoint_ids_tagged_cats = grouped_tags.loc['source', 'stanford_cats']['datapoint']
# Get the datapoints that are tagged as cats in the datapoints frame.
print(datapoints[datapoints['id'].isin(datapoint_ids_tagged_cats)])
Last updated
Was this helpful?