📖Miners basics
Last updated
Was this helpful?
Last updated
Was this helpful?
How do you improve a model? Try new architectures? Fine tune the hyper parameters? Recently, the industry has realized that rather than trying to work on the model, focusing on data is far more effective.
KatiML supports a Data Centric approach to model improvement. We help ML teams systematically engineer their data to get the best performance out or their models.
Through Active Learning, we select the best data to add to the retraining cycle to maximise improvement outcome while minimizing labeling and training costs.
Here is the question. Given the following state of your model, how do you pick the best data to retrain?
The answer is that it depends on what model weakness you are trying to fix. Dioptra develops techniques designed to fix specific kind of weakness. They can and should be combined to maximise the breath of model improvement at each retraining step.
When the model is confused about in domain data, we can use uncertainty sampling to detect confusing unlabeled data. This data is going to be close to the decision boundary.
Probably the most straightforward active learning technique. It consists in sampling low confidence samples. The drawback of confidence sampling is that only look at the confidence of the predicted class.
In cases where there are many classes, looking at the confidence of the predicted class is not enough. We need to look at the level of confidence of all classes. To do this, we compute the entropy of the confidence vector and sample for high entropy. As a reminder, entropy = 0
when there is no uncertainty and entropy = 1
when the uncertainty is maximal.
Another alternative to model uncertainty is use leverage Query By Committee. This techniques consists in training several models and have them predict on the same data point to compare their prediction. There are several techniques to produce such committee, one of them being to train the same model on separate data folds. This technique has the advantage to be applicable regardless of the model type and will model uncertainty all model outputs: classes, boxes etc.
Similar to Query By Committee MC Dropout models uncertainty by comparing several model output with each other. But, while Query By Committee requires several trainings, MC Dropout generates candidate predictions by activating the dropout layer at inference time. Doing so, we generate an approximate Bayesian inference which has proven to effectively model uncertainty in NN.
More details here
To setup your model to perform MC Dropout, set your dropout layers in training mode while running inference and call the model several times to generate different predictions.
To discover Out Of Domain data, we need to sample data that are far from the training data and that won't be caught by uncertainty sampling.
To discover these datapoints, we leverage techniques based on embeddings and model activation.
This technique measures the distance from the training dataset in the embedding space and returns the data that are the farthest away from it.
The drawback of the embedding distance is that it doesn't account for data density. A single training datapoint will cover large portions of the training set.
To compensate for that, Dioptra leverages a novelty detection algorithm trained on the training set and inferred on the unlabeled set to detect unlabeled data un areas of low training data density.
In certain cases, edge cases can be zeroed down to filtering down to a small subset of data that looks similar and looking for outliers in this space. We implement an outlier detection sampling technique based on Local Outlier Factor.
These techniques have proven effective to catch far OODs and can be biased by the quality and baseness of the embedding space. We recommend experimenting with different embedding layers, bearing in mind that the lower levers are going to remain relatively stable across tasks but will be generic, while the upper level will have greater discriminatory power but can become biased towards the task.
Another way to detect OODs is to look at the activation levels in the model while it makes a prediction. This is indicative of the amount of information the model is using to make a decision. This technique has proven to be the most effective but is less explainable and model specific.
One of the most common pattern in data curation is to find a seed (edge case, feedback from end users etc.) and look for similar data. You can do this by using our KNN miners.
One of the draw back of AL techniques is that they tend to sample similar data because they share the same properties. To compensate for that, we look can sample the output of a miner to find the most differentiable data. We use the coreset algorithm for that.
Each of the above mentioned sampling methods has drawbacks. As such, mixed sampling methods layer the sampling techniques. For example, the samples a diverse set of points based on gradients of embeddings which represents uncertainty.