To read this content please select one of the options below:

Design and implementation of crawling algorithm to collect deep web information for web archiving

Hyo-Jung Oh (Graduate School of Archives and Records Management, Chonbuk National University, Jeonju, The Republic of Korea)
Dong-Hyun Won (Center for Disaster Safety Information, Chonbuk National University, Jeonju, The Republic of Korea)
Chonghyuck Kim (Department of English Language and Literature, Chonbuk National University, Jeonju, The Republic of Korea)
Sung-Hee Park (Physical Medicine and Rehabilitation, Chonbuk National University, Jeonju, The Republic of Korea)
Yong Kim (Department of Library and information Science, Chonbuk National University, Jeonju, The Republic of Korea)

Data Technologies and Applications

ISSN: 2514-9288

Article publication date: 19 March 2018

Issue publication date: 22 March 2018

749

Abstract

Purpose

The purpose of this paper is to describe the development of an algorithm for realizing web crawlers that automatically collect dynamically generated webpages from the deep web.

Design/methodology/approach

This study proposes and develops an algorithm to collect web information as if the web crawler gathers static webpages by managing script commands as links. The proposed web crawler actually experiments with the algorithm by collecting deep webpages.

Findings

Among the findings of this study is that if the actual crawling process provides search results as script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep webpages in this case.

Research limitations/implications

To use a script as a link, a human must first analyze the web document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher, so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document contains script errors.

Practical implications

The research results show deep webs are estimated to have 450 to 550 times more information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps to enable deep web collection through script runs.

Originality/value

This study presents a new method to be utilized with script links instead of adopting previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted experiment, analysis of scripts on individual websites is needed to employ them as links.

Keywords

Acknowledgements

This work was supported by the Ministry of Education of the Republic of Korea and the National Research Foundation of Korea (NRF-2016S1A5B8913575). The chief of this project was Professor Kim, who has been with us for the past few years but will be remembered in our hearts for the coming countless years.

Citation

Oh, H.-J., Won, D.-H., Kim, C., Park, S.-H. and Kim, Y. (2018), "Design and implementation of crawling algorithm to collect deep web information for web archiving", Data Technologies and Applications, Vol. 52 No. 2, pp. 266-277. https://doi.org/10.1108/DTA-07-2017-0053

Publisher

:

Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited

Related articles