Applied Informetrics for Information Retrieval Research

Ian Ruthven (University of Strathclyde, Glasgow, UK)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 1 December 2004

241

Keywords

Citation

Ruthven, I. (2004), "Applied Informetrics for Information Retrieval Research", Journal of Documentation, Vol. 60 No. 6, pp. 700-703. https://doi.org/10.1108/00220410410568205

Publisher

:

Emerald Group Publishing Limited

Copyright © 2004, Emerald Group Publishing Limited


The field of informetrics is concerned with the regularities underlying the use and production of information, in particular studying the quantitative properties of information. Although one might expect a close collaboration between the study of information and the development of tools to access information, there is still relatively little formal connection between the fields of informetrics and information retrieval (IR). Wolfram's aim is to study the relationships between the two fields covering the existing connections and the potential for shared research and development. The book can be roughly split into three sections: a gentle introduction to the two fields, investigations on joint areas of interest and directions for future collaborations.

Wolfram begins by covering the essentials of IR systems development including the basic retrieval models, file structures and classical evaluation techniques. He points out the major milestones in IR development from its basis in automated library functions to Web‐based hypertexts. This is essentially a practical introduction to IR rather than a theoretical one, at this stage giving a comprehensive coverage of IR research areas rather than giving in‐depth analyses of any individual area. Although the referencing is a little light in places the text itself is lucid and sufficiently terse to maximise the areas of discussion.

He then outlines the main foundations of informetrics: the laws or generalisations in information process and production such as Lotka's law, Bradford's law and Zipf's law. As Wolfram notes, these mathematical formalisations are not limited to informetrics. Bates (2002), for example, has discussed Bradford's Law in relation to the growth of towns and communities. The discussion of these regularities is expanded into the robust and popular research topics of citation analyses and co‐citation analyses. Citing, the act of making explicit reference to another's work, has perhaps not always received the attention it deserves from outside the information science community. This may change: as Wolfram notes, in science at least, funding decisions are beginning to be dictated by citation measures. Certainly an anonymous colleague of mine works in a department which considers CiteSeer's (2004) impact ratings in requests for travel funding.

The citation analyses described are amenable as the basis for Web linkage studies. Ingwersen's (1998) work on Web impact factors, as discussed in section 3.3.3, demonstrate the potential practical applications to large bodies of linked information. The citation literature could provide interesting and useful adjuncts to Google's popularity metrics, bringing in issues of quality of Web pages as additional factors in relevance ranking. Finally he points to direction of informetrics in particular making the prediction that multivariate models, those that can cope with dependencies between variables, are necessary for dealing with the current large data sets. This is certainly the case in recent work on implicit modelling of interactive search behaviour which indicates that simple variables are often insufficient on their own to make valid predictions (e.g. Kelly and Belkin, 2001).

In chapter four, Wolfram moves on with an interesting discussion of data sampling methods for informetrics. As in any quantitative discipline the accuracy and representativeness of the data used in studies is crucial to the robustness of the theories developed, the analyses performed and the conclusions drawn. Web log data are covered in depth, as is appropriate when taking into account the many assumptions and decisions that must be made when handling this type of data. This chapter provides a checklist of issues that should be considered when investigating large data sets. This is not only useful for those actively researching in log analyses, but also for those using the results of this research. Simple counting analyses may not supply the rigour necessary to understand properly the data presented in log files, which is why the issue of modelling is important.

As Wolfram explains, models are representations or abstractions based on observed data, e.g. the frequency of topics found in a Web log file. Although models are only a representation of the observed data, they are useful because good models allow for some predictive power – we can make predictions about the real world from small representative sets – and for explanatory power – models typically allow for some reasoning about why the data appear in a particular pattern. However, these properties rely on us choosing the right model for a particular data set; one that resembles the observed data. Models, or the choice of models to represent data, are covered at the end of the chapter. The mathematics may be forbidding for less mathematically prepared readers but the coverage is excellent. Finally a survey of model fitting techniques (techniques that map the observed data distribution onto theoretical distributions) is presented. Again, the discussion is dense and surveys the issues involved in using model‐fitting techniques rather than providing a step‐step guideline into how to apply them, but the high‐level discussion is good. The middle section of the chapter covers data storage, which is of less immediate interest than the research issues which bookend this chapter.

The discussion on informetrics and IR is pulled together in chapters five and six with chapter five covering the system research. This chapter presents the informetrics that underpin IR systems research on issues such as term indexing and term co‐occurrence. The discussion on co‐occurrence highlights, albeit, implicitly, that distributions within data do not necessarily reflect useful distributions. However, it is clear that informetrics is not just a tool for analysis, it can also be used for practical applications such as predicting the size of index files for document collections or deciding on the indexing exhaustivity level.

A large section of chapter five is devoted to a presentation of regularities and potential regularities in document attributes with specific attention to Web documents and indexing. For example, distributions that measure the number of pages per Web site, or the persistence of documents on the Web. As the Web structure and behaviour is still relatively little understood, there are many open questions about what type of analyses are appropriate and how Web data differ from other, more deeply examined, document types.

Chapter six turns to user study with main discussion being based on the recent stream of research on Web log analysis. The sheer size of the data available for investigation make this area extremely open to quantitative informetrics studies and most studies have examined some aspect of the querying process. Wolfram surveys research into areas such as query length, query term co‐occurrence and querying frequency within search sessions. This chapter is a little disappointing in that, although methodological questions are presented, the results are often simply surveyed rather than the studies being examined in detail to understand the bases of the research and its long‐term reliability. The caveat to a lot of transaction log analyses: we can measure the products of the user's query behaviour, the queries themselves, not why this behaviour occurred is well made and necessary. Research into searching and browsing as opposed to just queries is covered in less depth, but seems better supported by informetric studies. Arguably the area of modelling interaction is of more long‐term value for researchers than simply modelling the queries themselves even if modelling interaction raises considerable challenges.

In chapter seven, Wolfram turns from the analytical nature of informetrics to practical applications. As noted above the modelling side of informetrics has the benefit of prediction; based on statistical analyses of data sets we can try to make predictions on the whole data set. This has implications for, among other things, simulating IR performance, estimating space requirements for storage or system usage. Interactive aspects such as query expansion through co‐occurrence and relevance ranking are also covered but the concentration is mostly on indexing terms rather than higher‐level concepts. The writing is heavy on the mathematical issues but worth a slow read, especially for those interested in working with large data sets. Chapter eight finishes with ideas for future research directions covering some of the more fruitful areas for joint collaboration between informetrics and IR.

The book itself is interesting, confident and has extremely good coverage. Naturally, as benefits a quantitative discipline, there is less discussion on qualitative issues such as context or semantics. Quantitative analyses, as Wolfram observes, uncover patterns not why those patterns arose in the first place. However, the call for more rigorous tools for uncovering these patterns, which can then stimulate more conceptual or qualitative work, is well made. If I have one criticism it is that the book often simply surveys approaches and previous research but does not evaluate them. I appreciate the point of the book is to demonstrate the application of informetrics, but it would be useful to hear an informed opinion on the appropriateness or value of previous research. The mathematics is perhaps a bit weighty for a textbook, unless for a very statistically literate class. However, for instructors who want a useful roadmap to the issues involved in informetric research, and its relation to IR, it would be a useful purchase.

References

Bates, M.J. (2002), “Speculations on browsing, directed searching, and linking in relation to the Bradford distribution”, in Bruce, H., Fidel, R., Ingwersen, P. and Vakkari, P. (Eds), Emerging Frameworks and Methods: Proceedings of the 4th International Conference on Conceptions of Library and Information Science (CoLIS 4), Libraries Unlimited, Greenwood Village, CO, pp. 13750.

CiteSeer (2004), available at: http://citeseer.nj.nec.com/impact.html (accessed 13 August).

Ingwersen, P. (1998), “The calculation of Web impact factors”, Journal of Documentation, Vol. 54 No. 2, pp. 23643.

Kelly, D. and Belkin, N.J. (2001), “Reading time, scrolling and interaction: exploring implicit sources of user preference for relevance feedback”, Proceedings of the 24th Annual International ACM Conference on Research and Development in Information Retrieval, New Orleans, LA, pp. 4089.

Related articles