Guide to the professional literature

Online Information Review

ISSN: 1468-4527

Article publication date: 1 May 2006

40

Citation

(2006), "Guide to the professional literature", Online Information Review, Vol. 30 No. 3. https://doi.org/10.1108/oir.2006.26430cae.001

Publisher

:

Emerald Group Publishing Limited

Copyright © 2006, Emerald Group Publishing Limited


Guide to the professional literature

This column is designed to alert readers to pertinent wider journal literature on digital information and research.

Evolution, continuity, and disappearance of documents on a specific topic on the web: a longitudinal study of ‘Informetrics’Bar-Ilan, J. and Peritz, B.C., in Journal of the American Society for Information Science and Technology, Vol. 55 No. 11, 2006 www.asis.org/Publications/JASIS/vol55n11.html

Bar-Ilan and Peritz searched for web pages on Informetrics in 1998, 1999, 2002, and 2003, in order to study the growth of web literature on this topic, as well as its tendencies toward item modification, disappearance, and resurfacing. The original search was carried out on the six largest search engines in 1998, and a union list of 886 URLs was searched and the results examined. In 1999, 1,297 were found by the search, and these did not include all of the previous results. However, a direct search of the missing URLs located 219 relevant pages missed by the 1999 engine search. In the 2002 and 2003 searches a new set of top five search engines was used, and additional formats beyond html and txt began to appear. In 2002, 3,746 traditional pages appeared with 329 other formats and in 2003, 4,389 traditional pages and 766 in other formats. In the 5.5-year period the topic grew six fold, while the web grew about tenfold. Of the 5,034 pages that satisfied the search through 2002 only 3,144 were available in 2003. Not only did 40 per cent disappear, but about half the remaining were modified.

Finding governmental statistical data on the web: a study of categorically organized links for the fedstats topics pageCeaparu, I. and Schneiderman, B., in Journal of the American Society for Information Science and Technology, Vol. 55 No. 11, 2006 www.asis.org/Publications/JASIS/vol55n11.html

Ceaparu and Shneiderman summarize three studies of alternate organization concepts for the FedStats portal which was designed to provide a single point of access to multiple federal agencies providing statistical data. Three questions; one broad and loosely defined, one specific, and one requiring a comparison were run against FedStat’s original alphabetical list of links, against a categorical grouping of these links, and against a categorical listing that links to the providing agency’s site rather than directly to the information. A different group of 15 graduate students searched the three questions in each structure, participating in a think aloud protocol, and completing a post search questionnaire on their opinions as to satisfaction and ease of use. Correct answers, as judged by the experimenter, were at 15.6 per cent in the first study, 24.4 per cent in the second, and 42.2 per cent in the third. Judgement of the site as useful increased from 35 to 47 per cent and then to 69 per cent. Perception of ease-of-use increased form 42 to 56 per cent, and finally to 73 per cent. The design principles of universal usability, easy navigation, common language, the availability of comparative search and an advanced search facility, and granularity of data as to time and geography, are seen as important for statistical data.

Recognition for a million booksChoudhury, G.S. et al., in D-Lib Magazine, Vol. 12 No. 3, 2006http://mirrored.ukoln.ac.uk/lis-journals/dlib/dlib/dlib/march06/choudhury/03choudhury.html

As initiatives such as Google Book Search and the Open Content Alliance efforts to digitize millions of books, there is potential to make available vast amounts of information. To unlock this knowledge it is necessary to process the resulting digital page images to recognize important content, including both the semantic and structural aspects. Given the diversity of fonts, symbols, tables, languages and a host of other elements, it is necessary to create flexible, modular, scalable document recognition systems. Optical character recognition (OCR) software is highly successful at transcribing documents using modern printing processes. However, a great deal of content in libraries, the target content for Google Book Search and the Open Content Alliance, does not fit into this category. With many collections, the limitations of commercial OCR software become apparent. Document recognition includes not only transcription of printed text but also extraction of the content implied by spatial layout: footnotes should not only be transcribed but associated with their markers within the text; tables should be identified and structured in the proper columns; rows, labels, block quotes, running headers, page numbers, and other conventional reading aids should be recognized and tagged. Document recognition continues until all information implicit in the page image itself has been captured. This focuses specifically on the use of Gamera3 and its potential use in the massive digital libraries that are currently being built. Since, standard OCR packages cannot accommodate this diversity of content and the range of document recognition needs, Johns Hopkins University and McGill University, with partners at Tufts University and the University of Edinburgh, and with contributions from an active developer community, are developing Gamera3 an open-source programming framework for building systems that extract information from digitized two-dimensional documents. A particular focus for Gamera has been the difficult document recognition challenges posed by cultural heritage materials. As a framework for the creation of structured document analysis by domain experts, Gamera combines a programming library with GUI tools for the training and interactive development of recognition systems. One of the most important aspects of Gamera is its ability to be trained from the ground up. Gamera has been used to build preliminary systems for many different types of documents including sheet music, medieval manuscripts, eighteenth-century census data, dissertations in mixed scripts, Navajo language documents and lute tablature. There are several applications of Gamera that demonstrate document recognition tasks beyond transcription.

From Babel to knowledge: data mining large digital collectionsCohen, D.J., in D-Lib Magazine, Vol. 12 No. 3, 2006 www.dlib.org/dlib/march06/cohen/03cohen.html

Cohen discusses how to construct a search engine to focus on specific tasks such as finding course syllabi using simple technologies, access to such resources as Google’s application program interface (API), and intelligent post-processing. Cohen concludes that:

  • more emphasis needs to be placed on creating APIs for digital collections;

  • resources that are free to use in any way are more valuable than those that are gated or use-restricted; and

  • quantity may make up for a lack of quality.

Some features of alt texts associated with images in web pagesCraven, T.C., in Information Research, Vol. 11 No. 2, 2006 http://informationr.net/ir/11-2/paper250.html

This paper extends a series on summaries of web objects, in this case, the alt attribute of image files. Data were logged from 1,894 pages from Yahoo!’s random page service and 4,703 pages from the Google Directory; an img tag was extracted randomly from each where present; its alt attribute, if any, was recorded; and the header for the corresponding image file was retrieved if possible. Associations were measured between image type and use of null alt values, image type and image file size, image file size and alt text length, and alt text length and number of images on the page. About 16.6 and 17.3 per cent of pages, respectively, showed no img elements. Of 1,579 and 3,888 img tags randomly selected from the remainder, 47.7 and 49.4 per cent had alt texts, of which 26.3 and 27.5 per cent were null. Of the 1,316 and 3,384 images for which headers could be retrieved, 71.2 and 74.2 per cent were GIF, 28.1 per cent and 20.5 per cent, JPEG; and 0.8 and 0.8 per cent PNG. GIF images were more commonly assigned null alt texts than JPEG images, and GIF files tended to be shorter than JPEG files. Weak positive correlations were observed between image file size and alt text length, except for JPEG files in the Yahoo! Set. Alt texts for images from pages containing more images tended to be slightly shorter. Possible explanations for the results include GIF files’ being more suited to decorative images and the likelihood that many images on image-rich pages are content-poor.

Support for XML markup of image-based electronic editionsDekhtyar, A. et al., in International Journal on Digital Libraries, Vol. 6 No. 1, 2006, pp. 55-69 www.springerlink.com/(spdxko45orduy355mo0fxf45)/app/home/ contribution.asp?referrer = parent&backto = issue,6,9;journal,2,21; linkingpublicationresults,1: 100475,1

Image-based electronic editions enable researchers to view and study in an electronic environment historical manuscript images intricately linked to edition, transcript, glossary and apparatus files. Building image-based electronic editions poses a two-fold challenge. For humanities scholars, it is important to be able to use image and text to successfully encode the desired features of the manuscripts. Computer Scientists must find mechanisms for representing markup in its association both with the images, text and other auxiliary files and for making the representation available for efficient querying. This paper addresses the architecture of one such solution, that uses efficient data structures to store image-based encodings in main memory and on disk.

The (digital) library environment: ten years afterDempsey, L., in Ariadne, Vol. 46, 2006 www.ariadne.ac.uk/issue46/dempsey/

What happened clearly in the mid-1990s was the convergence of the web with more pervasive network connectivity, and this made our sense of the network as a shared space for research and learning, work and play, a more real and apparently achievable goal. What also emerged – at least in the library and research domains – was a sense that it was also a propitious time for digital libraries to move from niche to central role as part of the information infrastructure of this new shared space. However, the story did not quite develop this way. We have built digital libraries and distributed information systems, but they are not necessarily central. A new information infrastructure has been built, supported by technical development and new business models. The world caught up and moved on. What does this mean for the library and the digital library? We are now again at a turning point. Ten years ago we saw the convergence of the human-readable web with increased connectivity. This time, we are seeing the convergence of communicating applications and more pervasive, broadband connectivity. We are seeing the emergence of several large gravitational hubs of information infrastructure (Google, Amazon, Yahoo, iTunes), the streamlining of workflow and process integration in a web services idiom, and new social and service possibilities in a flatter network world. The world is flatter because computing and communications is more pervasive of our working and learning lives: we create, share and use digital content and services.

A model-driven method for the design and deployment of web-based document management systemsPaganelli, F., and Pettenati, M.C., Journal of Digital Information, Vol. 6 No. 3: article 360, 2005 http://jodi.tamu.edu/Articles/v06/i03/Paganelli/

Most existing document management systems (DMSs) are designed according to an approach which is technology-driven rather than based on standard methodologies. Related shortcomings are vendor dependence, expensive maintenance and poor interoperability. Information model-driven methodologies could help DMS designers to solve these issues. As a matter of fact, information models can provide a technology-independent abstract representation of information systems’ functionalities. Based on standard formalisms, they are useful to designers to describe the managed domain and to developers to understand and develop the modelled entities according to a standard methodological approach. However, while information models are commonly used by software designers for the design of information systems, such as databases and digital libraries, their use in DMS design is still in its infancy. This paper provides a contribution in this research area proposing a method for web-based DMS design based on an information model, named document management and sharing information model (DMSM). The authors have also developed a set of tools, the DMSM Framework that provides designers with DMS design and deployment facilities. Based on this instrumental support, the proposed method facilitates the design and fast prototyping of DMSs, dealing with requirements of open standard compliance, cost effectiveness and uniform access to heterogeneous data sources.

The next web?St Laurent, S., in XML.com, 2006 www.xml.com/pub/a/2006/03/15/next-web-xhtml2-ajax.html

St Laurent reviews popular web technologies that have not yet lived up to their promise. He briefly reviews the XML Web, the Semantic Web, XHTML and Web Services, explaining that each required substantial new infrastructure to implement and for that reason “never quite made it to mainstream web development”. By contrast he points to the success of Ajax where the parts (JavaScript and HTML) have been used successfully for some time. “After waiting for all of those promises of better tools to come” he concludes, “it seems that developers looked at the parts they had available, and chose the ones they could use today. It can be annoyingly hard work, but the results are impressive.”

Sampling the web: the development of a custom search tool for researchSnelson, C., in LIBRES: Library & Information Science Research Electronic Journal, Vol. 16 No. 1, 2006 http://libres.curtin.edu.au/libres16n1/index.htm

Research designed to study the internet is beset with challenges. One of these challenges involves obtaining samples of web pages. Methodologies used in previous studies may be categorized into random, purposeful, and purposeful random types of sampling. This paper contains an outline of these methodologies and information about the development of a custom sampling tool that may be used to obtain purposeful random samples of web page links. The custom search application called Web Sampler works through the Google Web APIs service to collect a random sample of pages from search results returned from the Google Index. Web Sampler is inexpensive to develop and may be easily customized for specialized search needs required by researchers who are investigating web page content.

Three gathering storms that could cause collateral damage for open accessSuber, P., in SPARC Open Access Newsletter, Vol. 95, 2006 www.earlham.edu/ ∼ peters/fos/newsletter/03-02-06.htmcollateral

Suber previews three potential changes worth noting:

  • the WIPO treaty on the protection of broadcasting organizations;

  • threats to internet neutrality; and

  • fees for bulk e-mailers to circumvent major e-mail services’ spam filters.

These potential changes foreshadow deeper changes in the fundamental nature of the internet that may have significant long-term implications.

A digital library framework for biodiversity information systemsTorres, R. da S. et al., in International Journal on Digital Libraries Vol. 6 No. 1, 2006 www.springerlink.com/(spdxko45orduy355mo0fxf45)/app/home/contribution.asp?referrer = parent&backto = issue,2,9;journal,2,21; linkingpublicationresults,1: 100475,1

Biodiversity information systems (BISs) involve all kinds of heterogeneous data, which include ecological and geographical features. However, available information systems offer very limited support for managing these kinds of data in an integrated fashion. Furthermore, such systems do not fully support image content (e.g. photos of landscapes or living organisms) management, a requirement of many BIS end-users. In order to meet their needs, these users –, e.g. biologists, environmental experts – often have to alternate between separate biodiversity and image information systems to combine information extracted from them. This hampers the addition of new data sources, as well as cooperation among scientists. The approach provided in this paper to meet these issues is based on taking advantage of advances in digital library innovations to integrate networked collections of heterogeneous data. It focuses on creating the basis for next-generation BIS, combining new techniques of content-based image retrieval and database query processing mechanisms. This paper shows the use of this component-based architecture to support the creation of two tailored BIS systems dealing with fish specimen identification using search techniques. Experimental results suggest that this new approach improves the effectiveness of the fish identification process, when compared to the traditional key-based method.

Constructing web subject gateways using Dublin Core, the resource description framework and topic mapsTramullas, J. and Garrido, P., in Information Research, Vol. 11 No. 2, 2006 http://informationr.net/ir/11-2/paper248.html

Specialised subject gateways have become an essential tool for locating and accessing digital information resources, with the added value of organisation and previous evaluation catering for the needs of the varying communities using these. Within the framework of a research project on the subject, a software tool has been developed that enables subject gateways to be developed and managed. General guidelines for the work were established which set out the main principles for the technical aspects of the application, on the one hand, and on aspects of the treatment and management of information, on the other. All this has been integrated into a prototype model for developing software tools. The needs analysis established the conditions to be fulfilled by the application. A detailed study of the available options for the treatment of information on metadata proved that the best option was to use the Dublin Core, and that the metadata set should be included, in turn, in RDF tags, or in tags based on XML. The project has resulted in the development of two versions of an application called Potnia (versions 1 and 2), which fulfil the requirements set out in the main principles, and which have been tested by users in real application environments. The tagging layout found to be the best, and the one used by the writers, is based on integrating the Dublin Core metadata set within the Topic Maps paradigm, formatted in XTM.

The electronic academic library: undergraduate research behavior in a library without booksVan Scoyoc, A.M. and Cason, C., in Portal: Libraries and the Academy, Vol. 6 No. 1, 2006 http://muse.jhu.edu/journals/portal_libraries_and_theacademy/toc/pla6.1.html

This study examines undergraduate students’ research habits in a strictly electronic library environment at a large public university. Unlike most information commons, the campus’ electronic library is not housed within a traditional library space and provides access to electronic research materials exclusively. This study finds that undergraduate students in this electronic library rely primarily on internet sites and online instruction modules (for example, Blackboard or WebCT) for their research needs rather than university-funded research sources. Additionally, academic class status has no significant impact on whether students use either the library’s OPAC or the university-funded electronic databases for their research needs. The authors discuss possible reasons for these findings, new pedagogical practices as indicated by the results, and define areas for further research.

The myths and realities of SFX in academic librariesWakimotoa, J.C.; Walker D.S., and Dabboura, K.S., in Journal of Academic Librarianship, Vol. 32 No. 2, 2006, pp. 127-36

The report of a three-fold study (“end-user survey, librarian focus group interviews, and sample SFX statistics and tests”) to answer these questions regarding the use and effectiveness of an OpenURL resolver (SFX from ExLibris) in an academic setting:

How successful is the system in actually meeting the research needs of librarians and library users? Do undergraduate students, who have increasingly high expectations of online resources, think that SFX lives up to their expectations? Do librarians feel comfortable relying on SFX for accurate and consistent linking? Do the perceptions of librarians and library end-users reflect the reality of SFX usage?

The conclusion is that:

Ultimately, this study showed that end-user expectations were slightly higher than their actual experiences of obtaining full text. The majority of the librarians were positive, however, reporting that SFX worked most of the time. Both groups had complaints about SFX and saw areas for improvement, but they still rely heavily on it for their research.

Related articles