Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW

Michael Heine (Newcastle upon Tyne, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 February 2002

348

Keywords

Citation

Heine, M. (2002), "Finding Out About: A Cognitive Perspective on Search Engine Technology and the WWW", Journal of Documentation, Vol. 58 No. 1, pp. 125-128. https://doi.org/10.1108/jd.2002.58.1.125.10

Publisher

:

Emerald Group Publishing Limited

Copyright © 2002, MCB UP Limited


This is a fascinating work – the author describes it as a “textbook” but it is more than that – which will be of strong interest to those involved in classical or post‐Web IS&R and, in its electronic version, to those interested in the use of IT to support education. Many readers will be in both categories of course, as on the one hand educators put increasing weight on independent learning (and hence on retrieval as a process supportive of that) and on the other hand IS&R specialists become increasingly aware of the restraints on the growth of their subject imposed by its historic hard science approaches and recognise the need to embed it more within the learning of the individual.

The CD‐ROM version offers hypertext links to supplementary notes (what in the olden days we called footnotes and sidebars) and on‐disk source‐references synchronous with the printed text, plus links to off‐disk Web pages synchronous with the Web; software and sample document test collections (e.g. a word‐stemmed and stopword‐stripped excerpt from Encyclopaedia Britannica, and one comprising citations and abstracts from the UMI “Artificial Intelligence Thesis” corpus); and the complete content of another book! The last is van Rijsbergen’s (1979) well‐known monograph, re‐expressed in a sophisticated hypertext format (as on its Web page), with internal term‐term links ranked by association weight, and with source references scannable in a collateral window. Updating support is available (http://www.cse.ucsd.edu/∼rik/FOA). Such additional material and functionality make the work usefully supportive of both “studio” style learning and “transmission” style learning, to borrow Andriessen and Sandberg’s (1999) terms.

In terms of knowledge content, and refreshingly, Belew’s book takes a broader view of IS&R than has tended to be the case in most earlier books, as signalled by his phrase “a cognitive perspective” in the subtitle. Concepts are drawn from linguistics, formal logic, linear algebra, bibliometrics, AI and philosophy on an as‐needed basis, giving interdisciplinary vigour and legitimacy to one redefinition of IS&R. The approach is analytical and dispassionate (but enthusiastic) throughout, and the overall structure is particularly well thought out. At the same time one wonders, without detracting from the merit of seeking an interdisciplinary approach per se, whether the perspective offered on IS&R really is that of “cognitive science” rather than that of computer science with a more open door than usual. The overall style is reductionist rather than “top‐down holistic” as in the social sciences, say. (Readers interested in such tensions within a broader context may wish to note a recent, stimulating NYAS volume (Damasio et al., 2001)). One would like to hear the opinion of psychologists. The wider view offered is still very defensible nevertheless (and a purist would perhaps argue that paradigm‐free thought is impossible), but the point one seems to come back to, time and again, is that IS&R empiricism, with its affiliations to laboratory‐based, mathematics‐centred “hard science”, has for too long been unwilling to acknowledge the problems that non‐replicability and uniqueness of human information needs present, with their individual‐, time‐ and context‐specificity. This book may not quite have moved into that territory, but it has branched out. Also, since the book deals with document database searching generally, the restriction “and the WWW” in the title seems unduly modest. The theoretical approaches of IS&R are in principle as applicable to the proverbial card file in a shoe box, with pencilled terms on its constituents, as to a set of HTML files spinning on a disk, with embedded Dublin Core meta‐tags.

Readers, and especially tutors who might be considering this as a course textbook, will need to be aware that, quite apart from matters of disciplinary boundary and perspective, the book is not fully self‐contained regarding its use of theoretical tools. This is no reflection on Belew’s style which is generally lucid and user‐friendly. (Colloquialisms are used to good effect, e.g. “IR’s typical bag of words approach which aggressively ignores any ordering effects” (p. 213), as is computer Zen: “sometimes we care about noise words” (p. 47), and smilies are used for irony.) But he takes few prisoners when presenting formal theory. For example, where the discussion is of the dimensional reduction of matrices as an adjunct to the specification of storage or search mechanisms the reader is expected to be familiar with same. Some knowledge of theoretical statistics is also expected (word‐occurrence within a text as a Poisson process, Bayesian ideas applied to the training of AI document classifiers) and also calculus (gradient search procedures in AI systems). The notation of discrete mathematics is used throughout (if sometimes verging on overkill). Some might say that this what an IS&R text should be, but many students studying conventional information‐management courses will need to take preliminary courses to use the book to best effect. Computer skills are also assumed – the reader is expected to download, install and use Unix software, for example. In truth, the person who will find fewest problems – and most enjoyment – in using the book will be the student nearing completion of a modern computer science course, the person who talks confidently about “neural nets”, “class ancestors” and “balanced trees” over his or her morning coffee. Those deprived of such a background and still hoping to live IS&R life to the full will need a sympathetic tutor. However, a contrary and equally legitimate “take” on the issue so raised, and one in defence of such ignorance, is that retrieval problems of equal importance lie not at the level of the algorithm and formal mathematical model, but at the so‐called “softer” levels of system specification, situation, evaluation and choice. One needs excellent algorithm designers, naturally, and here the book functions at its best, but one also needs people who can, interactively, guide the information prospector to and through the most appropriate databases and systems and negotiate system specifications with the algorithm designer on the basis of real‐world experience and missions.

Adaptive retrieval is also given the space and importance it deserves (surely the main area for further research this decade, along with interface and database learning), and maths functions and procedures are very usefully classified and compared. There is an emphasis on AI aspects. Definitions are usually handled with care. In this regard, how reassuring to see, with reference to that spectre of the Brocken “probability of relevance”, a need for definition of the “underlying event space” (p. 168) – reductionist science can still score goals when needed! But not everything is tied down completely. The definition of “specificity” (p. 79) seems confused and might have noted that the literature contains several definitions of this term. What does the proposition (albeit one presented for discussion): “all documents have equal aboutness” (p. 17) actually mean? Also, the phrase “conditionalised on the belief that a keyword is relevant” (p. 82) seems unsatisfactory. A keyword is just a keyword. It may or may not be assigned to, or native to the text of, a relevant document. It may or may not feature within a formal search expression or in the (variable) verbal descriptions that might be chosen to characterise a particular information need. But only documents, not keywords, can be “relevant”.

A final criticism (offered with hesitation in view of the book’s strengths but in the hope that a revision will be offered in due course) is that Belew seems too uncritical of the recall concept – which is curious given his loyalty to adaptive retrieval. Although he writes, in this regard: “Whether any one, ‘omniscient’ individual is capable of providing reliable data about the appropriate set of documents to be retrieved remains a foundational issue within IR” (p. 118), this is an oblique point, since even the use of several judges raises this issue. References to “consensual relevance” still miss the point. There is a difference between: (1) information searching qua exploration, i.e. searching against provisional and adapting criteria, and (2) information searching against pre‐search defined, verbalised and fixed criteria. In the real world, the searcher’s criteria of informativeness in the documents or document‐surrogates that he or she examines: (1) are non‐verbal (hence the persistent sloppiness in terminology as between “queries” and “needs”, dating from Cranfield), and/or (2) may change as exposure to new documents takes place (i.e. where the user is “learning”, they can be output driven). It is surely not sufficient to say “… in the case of the TREC corpus, evaluations [of relevance] are in fact quite stable” (p. 120) when the “rules” of TREC’s system evaluations blithely ignore real‐ world contexts of need, i.e. adopt the laboratory paradigm. The title of Belew’s book also seems to beg this question. Document retrieval is certainly, but only partly, about “finding out about”; it is also about mental heurism, and what might be termed “IKIWISI [I’ll know it when I see it] relevance”, to resituate Boehm’s (2000) acronym. Exposure to information of unanticipated existence and value (allowing one to talk of “post search features of relevant documents”, rather than “pre‐search knowable criteria”) forms part of our everyday lives – and searchers actively seek such exposure as well as seeking documents that fill anticipated gaps in existing knowledge structures. (Ask any Web surfer or any bookshelf or OPAC browser.) We are life‐forms exercising a dynamic logic, not ROM‐driven dumb‐terminals, and as such we adapt to information, quite apart from whether or not we employ machine‐based adaptation algorithms to help us in so doing. Retrieval is not just a bran‐tub exercise. Ergo, “recall”’ captures only some of the picture.

In summary, and notwithstanding the above arguables, this is a very welcome addition to the book literature on IS&R, notable for its concise and very up‐to‐date summarising of theoretical knowledge, its stimulating cross‐disciplinarity, and its challenging and creative use of IT to support its readership. Many students will find it inspirational.

References

Andriessen, J. and Sandberg, J. (1999), “Where is education heading and how about AI?”, International Journal of Artificial Intelligence in Education, Vol. 10, pp. 13050. e‐journal accessible with permission, available at: http://cbl.leeds.ac.uk/ijaied/abstracts/Vol_10/andriessen.html (visited 12 June 2001).

Boehm, B. (2000), “Requirements that handle IKIWISI, COTS, and Rapid Change”, Computer (IEEE Computer Society), Vol. 33 No. 7, July, pp. 99102.

Damasio, A.R., Harrington, A., Kagan, J., McEwen, B.S., Moss, H. and Shaikh, R. (Eds) (2001), “Unity of knowledge: the convergence of natural and human science”, Annals of the NYAS, Vol. 935, New York Academy of Sciences, New York, NY.

van Rijsbergen, C.J. (1979), Information Retrieval, 2nd ed., Butterworths, London. Available at: www.dcs.gla.ac.uk/∼iain/keith/ (visited 12 June 2001).

Related articles