To read this content please select one of the options below:

Multi-granularity hierarchical topic-based segmentation of structured, digital library resources

Zhongyi Wang (School of Information Management, Central China Normal University, Wuhan City, Hu Bei Province, China)
Jin Zhang (School of Information Studies, University of Wisconsin-Milwaukee, Milwaukee, Wisconsin, USA)
Jing Huang (Wuhan Polytechnic, Wuhan City, Hu Bei Province, China)

The Electronic Library

ISSN: 0264-0473

Article publication date: 6 February 2017

443

Abstract

Purpose

Current segmentation systems almost invariably focus on linear segmentation and can only divide text into linear sequences of segments. This suits cohesive text such as news feed but not coherent texts such as documents of a digital library which have hierarchical structures. To overcome the focus on linear segmentation in document segmentation and to realize the purpose of hierarchical segmentation for a digital library’s structured resources, this paper aimed to propose a new multi-granularity hierarchical topic-based segmentation system (MHTSS) to decide section breaks.

Design/methodology/approach

MHTSS adopts up-down segmentation strategy to divide a structured, digital library document into a document segmentation tree. Specifically, it works in a three-stage process, such as document parsing, coarse segmentation based on document access structures and fine-grained segmentation based on lexical cohesion.

Findings

This paper analyzed limitations of document segmentation methods for the structured, digital library resources. Authors found that the combination of document access structures and lexical cohesion techniques should complement each other and allow for a better segmentation of structured, digital library resources. Based on this finding, this paper proposed the MHTSS for the structured, digital library resources. To evaluate it, MHTSS was compared to the TT and C99 algorithms on real-world digital library corpora. Through comparison, it was found that the MHTSS achieves top overall performance.

Practical implications

With MHTSS, digital library users can get their relevant information directly in segments instead of receiving the whole document. This will improve retrieval performance as well as dramatically reduce information overload.

Originality/value

This paper proposed MHTSS for the structured, digital library resources, which combines the document access structures and lexical cohesion techniques to decide section breaks. With this system, end-users can access a document by sections through a document structure tree.

Keywords

Acknowledgements

This study is supported by National Social Science Foundation of China: “Research on Multi-granularity Integration Knowledge Services of Digital Library Based on Linked Data” (14CTQ003).

Citation

Wang, Z., Zhang, J. and Huang, J. (2017), "Multi-granularity hierarchical topic-based segmentation of structured, digital library resources", The Electronic Library, Vol. 35 No. 1, pp. 99-120. https://doi.org/10.1108/EL-06-2015-0108

Publisher

:

Emerald Publishing Limited

Copyright © 2017, Emerald Publishing Limited

Related articles