What's in a Word‐list? Investigating Word Frequency and Keyword Extraction

Alenka Šauperl (Department of Library and Information Science and Book Studies, University of Ljubljana, Slovenia)

Journal of Documentation

ISSN: 0022-0418

Article publication date: 18 January 2011

423

Keywords

Citation

Šauperl, A. (2011), "What's in a Word‐list? Investigating Word Frequency and Keyword Extraction", Journal of Documentation, Vol. 67 No. 1, pp. 202-204. https://doi.org/10.1108/00220411111105533

Publisher

:

Emerald Group Publishing Limited

Copyright © 2011, Emerald Group Publishing Limited


Dawn Archer edited this book to demonstrate the benefits, which could be gained by research into word frequency and keyword analysis by methods of corpus linguistics, and to demonstrate the usefulness of these methods in other academic and non‐academic fields. She achieves the two goals by compiling nine papers written by herself and other researchers in corpus linguistics. They demonstrate different aspects of word list and keyword analysis clearly to the newcomer and thoroughly to the expert in the field. The first chapter, written by the editor, briefly presents all the contributions, to help the reader either select only particular papers and to introduce the whole.

John M. Kirk in the second chapter discusses strengths and weaknesses of studying word frequency. He reviews the concepts of the word and the range of words explaining the word frequency. He explains classes of words listed in classical works (e.g. the orthographic, phonological, morphological, lexicographical word) and adds two classes: the numerical and discourse word. Their frequency and importance in the text can easily be overlooked if word frequency is analyzed regardless of these classes. Meaning and importance of studying word frequency in the separate classes can be very informative, as can also the analysis of lower frequency be revealing.

David L. Hoover in “Word frequency, statistical stylistics and authorship attribution” presents innovations in analytic techniques. In the first analysis he demonstrates that the traditional method of author attribution can also be used to track stylistic changes in the opus of a single author. In the second analysis he demonstrates that rising the number of keywords from 100 to 700 to 800 greatly improves results, while rising the number to a few thousand may be redundant.

In “Word frequency in context” Mark Davies stresses that word frequency analysis cannot provide sufficient information and that frequency of phrases may be even more important. Because phrase frequency cannot be analyzed in usual linguistic analysis tools, he demonstrates relational databases with data on word sequence to study phrases.

Christian Kay presents problems that historical development of language and regional dialects present in keyword analysis. Older wording or spelling as well as regionally specific words cannot be automatically retrieved. Specific coding would be required, which can only be supported by structured databases. Most of the text available now are not structured and therefore offer limited options for research.

Mike Scott searches for a good or bad reference corpus. Comparison of a test text to a reference text is one of the basic research methods in corpus linguistics. Yet standards for a reference text are not developed. He presents results of a research where he intentionally tried to build a “bad” reference corpus. He found that generally a larger corpus should yield better results, but that for a small test text a medium size reference may be sufficient. An entirely absurd reference text still brings a set of keywords from the test text from which aboutness can be judged. However, it seems that reference texts of different genre may yield better results as a bigger and homogeneous reference corpus.

Tony McEnery employs moral panic theory on Mary Whitehouse texts and demonstrates that qualitative approach following a statistical analysis, or followed by it, will bring important insight into the text, which would otherwise remain hidden. Word frequency on the whole may be too low to be detected with quantitative methods, but becomes informative when manually coded in particular group of terms emerging from a theoretical framework.

Another empirical study is presented by Paul Baker. During the analysis of corpus from debates on the banning of fox hunting in the UK he discusses strengths and weaknesses of keyword analysis. He believes it is important to know word frequency, but context of those words is also important. It is also important to understand frequency of a particular word in usual language when one is evaluating its frequency a specific discipline. But not just very frequent words, less frequent ones are also important. On the other hand many synonyms and other words can substitute the one we study. These families of words that can be used to mean the same thing, can only be tagged manually. Only a larger corpus can allow for such an extensive analysis. But only a larger corpus is likely to indicate the discourse of the text.

In the last chapter presenting empirical research Dawn Archer, Jonathan Culpeper and Paul Rayson present their analysis of key domains in Shakespeare's comedies and tragedies. They explain clearly their research approach, which can greately be added by research tools developed for English language corpus linguistics. This method revealed differences in the use of keywords in comedies and tragedies. While “love”, for example, is frequent in comedies, it is rare in tragedies. Instead tragedies use a different set of words to build the atmosphere of the play. The authors again stress that statistical methods are insufficient for such research and need to be complemented by qualitative methods.

Dawn Archer also closes this interesting volume by laying out some questions and proposes spreading corpus linguistics into other disciplines. With that she means that written and spoken language of other disciplines (e.g., mathematics, history) should be studied. We may add that corpus linguistics could also be fruitfully used in other disciplines, e.g., information science. However, there is a limitation to this book. While many methods presented here are applicable to other languages, many research tools are only useful for English.

Related articles