Dioptra Documentation
  • What is KatiML ?
  • Overview
    • 🏃Getting Started
    • 🌊KatiML
      • Quick start
      • Ingestion basics
      • Ingestion SDK
      • Query basics
      • Query SDK
      • Dataset basics
      • Dataset SDK
      • Supported fields
      • Matching local data with Kati ML IDs
      • Managing Datapoints with Tags
      • Configuring Object Stores (optional)
    • 🧠Active Learning
      • 📖Miners basics
      • ⛏️Miners SDK
      • 🚗[Experimental] Mining on the edge
    • 🤖PyTorch and Tensorflow integrations
      • Tensorflow
      • PyTorch
  • 😬Enough docs, show me some code !
  • 📑Case studies
  • Definitions
Powered by GitBook
On this page
  • Matching with metadata URI
  • Matching with tags

Was this helpful?

  1. Overview
  2. KatiML

Matching local data with Kati ML IDs

PreviousSupported fieldsNextManaging Datapoints with Tags

Last updated 1 year ago

Was this helpful?

One common usage pattern in Kati ML is to be able to match local data to the one in katiML.

In katiML, datapoints are identified by a unique id . This id is retrieved during a query call.

Matching with metadata URI

One way to match Kati ML id with local ids is with the uri

Let's take the following example:

from dioptra.lake.utils import select_datapoints

df_1 = select_datapoints(
    filters=[{
        'left': 'tags.value',
        'op': '=',
        'right': 'stanford_dogs'}])
df_1

In this example, the query returned two datapoints. Their uri is in the metadata column as a JSON field. So what if you have another dataframe with uri and groundtruth and need to match it with the id from the lake ?

To do this you'd create two new columns in the dataframes with the uri then do a join and write the output as json

import pandas as pd
import json
from dioptra.lake.utils import select_datapoints

datapoints_df = select_datapoints([])
with open('my_file_with_new_data.json', 'r') as f:
    my_data = json.load(f)
    
my_data_df = pd.DataFrame(my_data)

my_data_df['uri'] = my_data_df['metadata'].apply(lambda x: x['uri'])
datapoints_df'uri'] = datapoints_df['metadata'].apply(lambda x: x['uri'])

df_new = datapoints_df'uri.set_index('uri').join(my_data_df.set_index('uri'))
results = []
for row in df_new[['id', 'groundtruths']].iterrows():
    results.append(row[1].to_json(f))

Matching with tags

Another way to match to local ids is to use tags.

You can tag a data point with a custom tag that represent your local id. For example datapoint_id

from dioptra.lake.utils import select_datapoints
my_df = select_datapoints([], fields=['id', 'tags.*'])

To connect your tags.datapoint_id with the katiMLid you'd need to first explode the tags column then, select the tags with the name datapoint_id . The value datapoint is your datapoint_id and the datapoint is LakeML datapoint id

import pandas as pd
exploded_df = my_df.explode('tags')['tags'].apply(pd.Series)
mapping = exploded_df[exploded_df['name'] == 'datapoint_id'][['value', 'datapoint']]

🌊