Active Cleaning of Label Noise

Goldgof, Dmitry B.; Kramer, Kurt; Hall, Lawrence O.; Shreve, Matthew; Ekambaram, Rajmadhan; Kasturi, Rangachar; Fefilatyev, Sergiy

doi:10.1016/j.patcog.2015.09.020

File(s) stored somewhere else

https://doi.org/10.1016/j.patcog.2015.09.020

Please note: Linked content is NOT stored on University of Notre Dame and we can't guarantee its availability, quality, security or accept any liability.

Active Cleaning of Label Noise

journal contribution

posted on 2022-05-25, 00:00 authored by Dmitry B. Goldgof, Kurt Kramer, Lawrence O. Hall, Matthew Shreve, Rajmadhan Ekambaram, Rangachar Kasturi, Sergiy Fefilatyev

Mislabeled examples in the training data can severely affect the performance of supervised classifiers. In this paper, we present an approach to remove any mislabeled examples in the dataset by selecting suspicious examples as targets for inspection. We show that the large margin and soft margin principles used in support vector machines (SVM) have the characteristic of capturing the mislabeled examples as support vectors. Experimental results on two character recognition datasets show that one-class and two-class SVMs are able to capture around 85% and 99% of label noise examples, respectively, as their support vectors. We propose another new method that iteratively builds two-class SVM classifiers on the non-support vector examples from the training data followed by an expert manually verifying the support vectors based on their classification score to identify any mislabeled examples. We show that this method reduces the number of examples to be reviewed, as well as providing parameter independence of this method, through experimental results on four data sets. So, by (re-)examining the labels of the selective support vectors, most noise can be removed. This can be quite advantageous when rapidly building a labeled data set.

History

Date Modified

2022-05-25

Usage metrics

Keywords

Not Assigned

Licence

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

File(s) stored somewhere else

Active Cleaning of Label Noise

History

Date Modified

Usage metrics

Categories

Keywords

Licence

Exports