Mining the Web: Discovering Knowledge from Hypertext Data

Surithong Srisa‐ard (Mahasarakham University, Thailand)

Online Information Review

ISSN: 1468-4527

Article publication date: 1 August 2003

652

Keywords

Citation

Srisa‐ard, S. (2003), "Mining the Web: Discovering Knowledge from Hypertext Data", Online Information Review, Vol. 27 No. 4, pp. 291-291. https://doi.org/10.1108/14684520310489113

Publisher

:

Emerald Group Publishing Limited

Copyright © 2003, MCB UP Limited


Today the Web has become the largest storehouse of knowledge and a place for widely distributed dynamic hypertext information. Web mining is a very important technique to uncover the interesting and valuable information and knowledge buried in billions of Web pages. Mining the Web: Discovering Knowledge from Hypertext Data provides the theoretical and practical concepts to build innovative applications for mining the Web. It is a mix of scientific and statistical programming with system engineering and optimisations. As the goal of this book, the author intended to study and develop programs that connect people to the information they seek from the Web. Besides being an assistant professor in computer science and engineering, he is a researcher, a guest editor of the IEEE TKDE special issue on mining and searching the Web, and a system developer for Web mining. Dr Chakrabarti assumes that the reader is a regular user of search engines, topic directories, and Web content in general. This book is based on his tutorials, lectures, and surveys on this topic over recent years.

The contents of this book are suitable for senior undergraduates, graduate students, researchers, or innovative developers in the area of Web mining who have a background in elementary undergraduate statistics, algorithms, and networking. The book has three parts with nine chapters. The introduction part or chapter one starts with the sequence of material in the book: crawling and indexing; topic directories; clustering and classification; hyperlink analysis; resource discovery and vertical portals; and structured vs. unstructured data mining. Part I, Infrastructure, sets the scene. Chapter two gives basic knowledge on crawling the Web and focuses on the organisation of large‐scale crawlers, which must handle millions of servers and billions of pages. The sub‐topics in this chapter include HTML and HTTP basics; crawling basics; engineering large‐scale crawlers; and putting together a crawler. Chapter three discusses how Web search engines work and reviews classical information retrieval. The sup‐topics of this chapter include Boolean queries and the inverted index; relevance ranking; and similarity searches. Part II, Learning, includes chapters about similarity and clustering, supervised learning, and semi‐supervised learning. This part focuses on machine learning, hypertext and the art of creating programs that seek statistical relations between attributes extracted from Web documents. These relations can be used to discover topic‐based clusters from Web pages, assign a Web page to a predefined topic, or match a user’s interest to Web sites. Part III, Applications, includes the following three chapters. Chapter seven deals with a variety of link‐based techniques for analysing social networks that enhance text‐based retrieval and ranking strategies. Chapter eight discusses the paradigms for locating desired resources in distributed hypertext. Chapter nine describes a few techniques for analysing documents at the level of tokens, their proximity, and their relationship with one another.

This book is good for a researcher, a Web technology developer, a graduate student or an interested reader in technical and state‐of‐the‐art knowledge on data mining and Web technology, who has some background knowledge of computer science and engineering.

Related articles