Matching local data with Kati ML IDs

One common usage pattern in Kati ML is to be able to match local data to the one in katiML.

In katiML, datapoints are identified by a unique id . This id is retrieved during a query call.

Matching with metadata URI

One way to match Kati ML id with local ids is with the uri

Let's take the following example:

from dioptra.lake.utils import select_datapoints

df_1 = select_datapoints(
    filters=[{
        'left': 'tags.value',
        'op': '=',
        'right': 'stanford_dogs'}])
df_1

In this example, the query returned two datapoints. Their uri is in the metadata column as a JSON field. So what if you have another dataframe with uri and groundtruth and need to match it with the id from the lake ?

To do this you'd create two new columns in the dataframes with the uri then do a join and write the output as json

import pandas as pd
import json
from dioptra.lake.utils import select_datapoints

datapoints_df = select_datapoints([])
with open('my_file_with_new_data.json', 'r') as f:
    my_data = json.load(f)
    
my_data_df = pd.DataFrame(my_data)

my_data_df['uri'] = my_data_df['metadata'].apply(lambda x: x['uri'])
datapoints_df'uri'] = datapoints_df['metadata'].apply(lambda x: x['uri'])

df_new = datapoints_df'uri.set_index('uri').join(my_data_df.set_index('uri'))
results = []
for row in df_new[['id', 'groundtruths']].iterrows():
    results.append(row[1].to_json(f))

Matching with tags

Another way to match to local ids is to use tags.

You can tag a data point with a custom tag that represent your local id. For example datapoint_id

from dioptra.lake.utils import select_datapoints
my_df = select_datapoints([], fields=['id', 'tags.*'])

To connect your tags.datapoint_id with the katiMLid you'd need to first explode the tags column then, select the tags with the name datapoint_id . The value datapoint is your datapoint_id and the datapoint is LakeML datapoint id

import pandas as pd
exploded_df = my_df.explode('tags')['tags'].apply(pd.Series)
mapping = exploded_df[exploded_df['name'] == 'datapoint_id'][['value', 'datapoint']]

Last updated

Was this helpful?