Managing Datapoints with Tags

In katiML, tags are name-value pairs that can be attached to datapoints. They help selecting and managing your metadata.

Tags are created and updated with a dictionary named tags in the datapoint object. This ensures that each tag name (tags.name) is unique for a datapoint (datapoints.id):

from dioptra.lake.utils import upload_to_lake

upload_to_lake({
    "id": ...,
    "tags": {
        # Add or update a tag with name "foo"
        "foo": "bar",
        # Set to null to delete the tag with name "baz"
        "baz": null,
        ...
    },
    "predictions": [...]
})

Tags structure

Tags are a child table of datapoints. As such, you can retrieve them and use them to filter datapoints with the appropriate arguments specified.

datapoints_dataframe = select_datapoints(filters=[...], fields=[...])

Field

Description

tags.name

The name of the tag. Unique for a datapoint.

tags.value

The value of the tag.

tags.datapoint

The datapoint id this tag is attached to.

Tags Usage

Tags can be used anywhere you use datapoint filters. For example, the following filters will select all datapoints with tags source: stanford_dogs AND dataset: train.

[{
    "left": "tags.name",
    "op": "=",
    "right": "source"
}, {
    "left": "tags.value",
    "op": "=",
    "right": "stanford_dogs"
}, {
    "left": "tags.name",
    "op": "=",
    "right": "dataset"
}, {
    "left": "tags.value",
    "op": "=",
    "right": "train"
}]

We'll illustrate the usage of tags with the following code:

import os
import pandas   as pd

# os.environ['DIOPTRA_API_KEY'] = 'DIOPTRA_API_KEY'

from dioptra.lake.utils import upload_to_lake, wait_for_upload, delete_datapoints, select_datapoints

# Upload your metadata.
upload_id = upload_to_lake(records=[{
    'metadata': {
        'uri': 'https://dog.jpg'
    },
    'type': 'IMAGE',
    'groundtruth': {
        'task_type': 'CLASSIFICATION',
        'class_name': 'chihuahua'
    },
    'tags': {
        'source': 'stanford_dogs',
        'dataset': 'train'
    }
}, {
    'metadata': {
        'uri': 'https://cat.jpg'
    },
    'type': 'IMAGE',
    'groundtruth': {
        'task_type': 'CLASSIFICATION',
        'class_name': 'bengal'
    },
    'tags': {
        'source': 'stanford_cats',
        'dataset': 'test'
    }
}])

upload = wait_for_upload(upload_id)

# Select datapoints with a filter allowing multiple tag values.
datapoints = select_datapoints(fields=[
    'id', 'tags.*', 'metadata.uri', 'groundtruths.class_name'
], filters=[{
    'left': 'tags.name',
    'op': '=',
    'right': 'source'
}, {
    'left': 'tags.value',
    'op': 'in',
    'right': ['stanford_dogs', 'stanford_cats']
}])

# Retrieve a list of all tags on the selected datapoints.
tags_df = datapoints.explode('tags')['tags'].apply(pd.Series)
print(tags_df)

# Group tags by name and value so we can filter on those.
grouped_tags = tags_df.groupby(['name', 'value'])[['datapoint']].agg(list)
print(grouped_tags)

# Get the datapoints that are tagged as cats in the datapoints frame.
datapoint_ids_tagged_cats = grouped_tags.loc['source', 'stanford_cats']['datapoint']

print(datapoints[datapoints['id'].isin(datapoint_ids_tagged_cats)])

Retrieving the list of tags

Assuming you went through the Quick Start and Ingestion Basics, let's review the following line:

tags_df = datapoints.explode('tags')['tags'].apply(pd.Series)

The dataframe returned by select_datapoints contains datapoints and a column named tags corresponding to the requested child table tags.* which are the tags attached to the datapoints we are selecting.

We want to explode the datapoints dataframe along the tags column to have a flat list of tags. We then turn each tag dictionary into a row with .apply(pd.Series)

The terminal prints something like this: a dataframe of tags.


                                     id     name          value                             datapoint           organization_id
0  1ae13d87-abcf-475f-87f0-78381960499f   source  stanford_dogs  55a9f19b-8723-48d0-a235-cddaf50e4f38  63ee72748d1ad3fb82cec9ab
0  4de84517-f53f-4182-8015-7e84c9ebc350  dataset          train  55a9f19b-8723-48d0-a235-cddaf50e4f38  63ee72748d1ad3fb82cec9ab
1  6fafb890-30aa-41e1-b92c-897caf509e9f  dataset           test  bf80ca4d-a8de-4f2e-8970-3b5beb4958ae  63ee72748d1ad3fb82cec9ab
1  b070dedc-6bf0-4d74-9519-ddc84f84e5b9   source  stanford_cats  bf80ca4d-a8de-4f2e-8970-3b5beb4958ae  63ee72748d1ad3fb82cec9ab

Grouping Tags by Name and Value

Next we'll group the tags by name and value so we can select groups of datapoints.

grouped_tags = tags_df.groupby(['name', 'value'])[['datapoint']].agg(list)

Here we use the pandas grouping operators to aggregate the datapoint column into a list of datapoints by unique value of tags name and value.

                                                    datapoint
name    value
dataset test           [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
        train          [55a9f19b-8723-48d0-a235-cddaf50e4f38]
source  stanford_cats  [bf80ca4d-a8de-4f2e-8970-3b5beb4958ae]
        stanford_dogs  [55a9f19b-8723-48d0-a235-cddaf50e4f38]

Selecting Datapoints Based on Tag values

datapoint_ids_tagged_cats = grouped_tags.loc['source', 'stanford_cats']['datapoint']

# Get the datapoints that are tagged as cats in the datapoints frame.
print(datapoints[datapoints['id'].isin(datapoint_ids_tagged_cats)])

PreviousMatching local data with Kati ML IDs NextConfiguring Object Stores (optional)

Last updated 2 years ago

Was this helpful?