To read this content please select one of the options below:

Building a training dataset for classification under a cost limitation

Yen-Liang Chen (National Central University, Taoyuan, Taiwan)
Li-Chen Cheng (National Taipei University of Technology, Taipei, Taiwan)
Yi-Jun Zhang (National Central University, Taoyuan, Taiwan)

The Electronic Library

ISSN: 0264-0473

Article publication date: 24 February 2021

Issue publication date: 18 May 2021

230

Abstract

Purpose

A necessary preprocessing of document classification is to label some documents so that a classifier can be built based on which the remaining documents can be classified. Because each document differs in length and complexity, the cost of labeling each document is different. The purpose of this paper is to consider how to select a subset of documents for labeling with a limited budget so that the total cost of the spending does not exceed the budget limit, while at the same time building a classifier with the best classification results.

Design/methodology/approach

In this paper, a framework is proposed to select the instances for labeling that integrate two clustering algorithms and two centroid selection methods. From the selected and labeled instances, five different classifiers were constructed with good classification accuracy to prove the superiority of the selected instances.

Findings

Experimental results show that this method can establish a training data set containing the most suitable data under the premise of considering the cost constraints. The data set considers both “data representativeness” and “data selection cost,” so that the training data labeled by experts can effectively establish a classifier with high accuracy.

Originality/value

No previous research has considered how to establish a training set with a cost limit when each document has a distinct labeling cost. This paper is the first attempt to resolve this issue.

Keywords

Acknowledgements

The authors would like to express our gratitude to the reviewers for their suggestions which have led to a substantial improvement in our paper. This study was supported in part by the Ministry of Science and Technology of Taiwan under grant numbers MOST 105-2410-H-031-035-MY3 and MOST 108-2410-H-027-020.

Citation

Chen, Y.-L., Cheng, L.-C. and Zhang, Y.-J. (2021), "Building a training dataset for classification under a cost limitation", The Electronic Library, Vol. 39 No. 1, pp. 77-96. https://doi.org/10.1108/EL-07-2020-0209

Publisher

:

Emerald Publishing Limited

Copyright © 2021, Emerald Publishing Limited

Related articles