To read this content please select one of the options below:

Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

Xiangbin Yan (University of Science and Technology, Beijing, China)
Yumei Li (Harbin Institute of Technology, Harbin, China)
Weiguo Fan (Department of Accounting and Information Systems, Virginia Polytechnic Institute and State University, Blacksburg, Virginia, USA)

Information Discovery and Delivery

ISSN: 2398-6247

Article publication date: 20 November 2017

339

Abstract

Purpose

Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and effective decision-making based on ubiquitous and unstructured social media data. This paper aims to design a framework for revoking noisy data from UGC.

Design/methodology/approach

In this paper, the authors consider a classification-based framework to remove the noise from the unstructured UGC in social media community. They treat the noise as the concerned topic non-relevant messages and apply a text classification-based approach to remove the noise. They introduce a domain lexicon to help identify the concerned topic from noise and compare the performance of several classification algorithms combined with different feature selection methods.

Findings

Experimental results based on a Chinese stock forum show that 84.9 per cent of all the noise data from the UGC could be removed with little valuable information loss. The support vector machines classifier combined with information gain feature extraction model is the best choice for this system. With longer messages getting better classification performance, it has been found that the length of messages affects the system performance.

Originality/value

The proposed method could be used for preprocessing in text mining and new knowledge discovery from the big data.

Keywords

Citation

Yan, X., Li, Y. and Fan, W. (2017), "Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum", Information Discovery and Delivery, Vol. 45 No. 4, pp. 181-193. https://doi.org/10.1108/IDD-04-2017-0043

Publisher

:

Emerald Publishing Limited

Copyright © 2017, Emerald Publishing Limited

Related articles