Dataset basics
In katiML, datasets are version controlled collections of datapoints. This means that on top of being able to add / remove datapoints to a dataset, it is possible to commit, checkout, diff datasets, just like in git.
NOTE: UNLIKE git, katiML version control is centralized. There are no local repositories and every users share the same versions of the datasets. Modifying a dataset results in a shared, uncommitted version of the dataset for everyone. Similarly, checking out a version of a dataset will make this version current for everyone.
Version Control commands
Commit
Creates a new version of a dataset from the modifications that were uncommitted.
View
View a previous version of the dataset. A committed version cannot be modified. A previous commit can be checked out, the dataset modified and a new version can then be committed.
Checkout
Checkout (or rollback to) a given commit and make it current. The dataset can then be modified and changes committed to create a new version.
Diff
Diff two versions of a dataset. Will show the datapoints that were inserted, deleted as well as the change in distribution of the Ground Truths and tags
Difference between katiML Version Control and DVC
The main difference between katiML Version Control and DVC is the level at which the version control happen.
With DVC, the version control is done at a file level. DVC has no understanding of the content of the file. This means that when doing a diff, DVC can tell you which file changed, not how the dataset changed.
Last updated
Was this helpful?