An XML document repository: A new home for University at Buffalo, library systems

Library Hi Tech News

ISSN: 0741-9058

Article publication date: 1 June 2003

132

Citation

Ludwig, M. (2003), "An XML document repository: A new home for University at Buffalo, library systems", Library Hi Tech News, Vol. 20 No. 6. https://doi.org/10.1108/lhtn.2003.23920faf.003

Publisher

:

Emerald Group Publishing Limited

Copyright © 2003, MCB UP Limited


An XML document repository: A new home for University at Buffalo, library systems

Mark Ludwig

Introduction

After spending 16 months working on an early conversion and implementation team for a "new" library management system at a good-size academic library system, I was exhausted and reconsidering the meaning of life, the universe, and everything. The project had started off well enough. We trekked for a year and a half to statewide meetings in the capitol, met with genial colleagues from around the state, selected what we thought was the latest and greatest. The statewide consortium provided a good funding model, a fair-and-square selection process, and a commitment from all concerned to install a common system at regional hubs across a 64 campus state. The vendor promised to build and document the system and "make it work."

When our slot came for conversion we met with the vendor and compiled specifications high and deep. Their analyst took notes and asked interested questions. We planned a nine-month conversion and targeted our next available summer. We held training sessions and the vendor's trainers did a bang-up job of teaching us things they had never seen before.

We bought and installed hardware during snowstorms, and implemented one small college that "just had to be first." We held monthly status meetings and were shuffled from one vendor project manager to another. As the conversion testing for the big university lagged, we began to see slippage, backsliding and a general malaise about our heavily crafted specifications. Their conversion programmer quit, we applied endless patches, our management lost its usual positive, co-operative patina, and so after one final time trial of nine long weeks, we put the whole ugly effort on the shelf. Next year, after the software evolves a few more "releases," we will undoubtedly grab the gorilla's throat and try it again.

So, after all of that and after consultation with other implementation sites, we all agreed that implementing a library management system just should not be this hard. After all, we can subscribe to a cosmos of fantastic full-text database products and display it all proudly on a Web site and make our fledgling mainframe-based Web catalog just one more button among the many.

In fact, since we had been successful in Web-fronting the old mainframe, I had to wonder if there was something more we could consider as we waited for our "new" vendor's LMS to mature.

Library catalog pages as documents

Most major LMS packages currently selling to large academic libraries are based on an RDBMS (Relational Data Base Management System). They use Oracle or Sybase or Microsoft SQL Server to slice and dice data into fields arranged in rows and columns in a spreadsheet-like "relational model". For most application development of the past 20 years this has been generally accepted as normal and desirable. And for the past 20 years, library system vendors have been trying to pound that square peg into the proverbial round hole.

Bibliographic data is text; full-text databases are text. Why do we keep trying to stuff long, variable length text into tables? I believe that is the fundamental architectural flaw in today's library management system packages. It is the continuation of the old host-based, terminal-based architecture of the past. Terminals had 24 lines of 80 characters. Stuff it on, we did. Today we have HTML pages and XML pages and so we do not have to worry about fixed length rows and columns, thanks to today's Internet browsers. It is time to re-architect our library management systems to be native, Web-oriented applications and to make use of the good graces of the Web-server/Internet-browser environment. The time for screen scraping, façade-like interfaces is over.

Bibliographic content, and full-text content, and Dublin Core tagged digital objects are much better done in XML Web pages. Almost every application in the library world is better served by this model.

University at Buffalo data conversion experiments

I knew there had to be some better way to build a modern catalog. So we began a few data conversion experiments at the University at Buffalo (UB). I had known since the early days of the Internet that it would be quite easy to convert MARC records to HTML because it was a matter of changing MARC tags to a smaller set of HTML tags. Now that disk space is very inexpensive, we were able to actually convert 2,300,000 MARC records into 2,300,000 html Web pages on a UNIX file system. Apache easily served up the pages with blinding speed.

Our next challenge involved integrating holdings information into the 2.3 million Web pages. This seemed like a job for XML. XML allows the integration of standards with local practice. It was very easy to design a catalog page that contained three sections. Our catalog page consisted of MARC bibliographic tags, called the "work" section, followed by a "holding" section for copy and call number tags and finally followed by a "tail" section for any additional links or content we might want to add to the page. At the time we began the XML experiments, Stanford Medical library's XMLMARC looked like a good way to tag our bibliographic "works." We followed a "light" version of that standard and are evolving it by making it more granular as development continues. Our own standard for holdings, based on the extraction from Notis, was used for the copy and called number data (Figure 1).

Figure 1 A sample of XMLMARC-lite tagged bibliographic data

Adventures in indexing

Various experiments with Web indexing tools were attempted and several free products worked quite well. All had their limitations and thresholds and with HTML as the source, keyword search was fairly simple. WebGlimpse, SWISH and ht://Dig are prominent examples of cost-free engines.

We loaded our 2.3 million records into 2.3 million XML Web pages and began testing the Inktomi Enterprise Search Engine on the Web site. This product has since been sold to Verity, Inc. The product will spider any Web site and index the pages. It is very easy to set up and test against an XML encoded Web site. The spider builds the indexes quite slowly with large numbers of pages. It took about 15 days to index the file system on a 4 processor SPARC E4500. However, its robust support of XML tagged elements and simple setup is simply outstanding.

Final architecture – "documentbase"

Four programs accomplished the extraction and conversion to XML. On the mainframe, one program extracted US MARC records from Notis, and another extracted Notis holdings records and converted them to XML holding statements. On the Windows side we have two other programs, one converts the extracted US MARC records into XML files, one for each bib record, named after the permanent Notis record number. The second program inserts the holdings records as XML statements into these bib records, based on matching Notis record-numbers. An additional program is under construction as of this writing. It will insert current circulation statuses into the documentbase and provide some patron empowerment facilities.

The TextML product, from Ixiasoft, was purchased because it provides three important components of infrastructure. First, it acts as a formal "repository" for documents. When you are dealing with millions of documents, it seems a bit scary to leave them out on an ordinary file system and Web server. TextML securely stows your XML content in a documentbase and offers standard ways to insert, delete and update documents. Second, TextML offers a very fast indexing engine with several different types of indexes. This allows search capability that vastly exceeds and outruns most library system search engines. Last, but not least, a formal API is provided so application programs, in many languages, both Web and server-based, can manipulate the documentbase and its indexes and its result sets. This gives you the ability to design search applications in any manner deemed appropriate for your users.

Style

XML allows a separation between the physical storage of the document and the display of the document. We discovered the great advantage of keeping all bibliographic data in the document while displaying what we consider the main "citation" information we think the user usually wants to see. So all bibliographic and holding elements on the pages are indexed and searchable and the user sees a less-cluttered page crafted by a crafty committee of experienced librarians. They determined what should display, what should not display and how it should be labeled.

An XSL stylesheet was developed and provides a standard method of controlling the catalog page appearance. This kind of code is well documented and evolving into a powerful data transformation language. It offers significant hope for a routine method of customizing applications. The stylesheet is kept on a Web server and the user's browser applies the stylesheet to the catalog page data. This offers the added advantage of keeping the presentation processing were it belongs, on the user's desktop. It is a realization of processing distribution long promised by client server technology (Figure 2).

Figure 2 Part of typical catalog page rendered with an XSL stylesheet

Inexpensive, fast platform

Converting MARC records to XMLMARC with a Visual C program proved amazingly fast on a modern Windows server with dual 2.4GHz processors. Extraction and conversion of the data took less than 24 hours. Indexing was fast. We generated an overall keyword index of the whole catalog page with only two stop words. We also generated three browse indexes for author, title and subject headings. Finally, we added a date index so it is possible to search ranges of publication years. TextML contains a graphical user interface for index specifications and then stores them in an XML format for future exchange with other systems. All of the indexing of 2.3 million catalog pages, approximately 6 gbytes of XML data, took 18 hours. Building the simple keyword index took less than three hours (Figure 3).

Figure 3 XML OPAC - a TextML documentbase search application

These are revolutionary speeds considering the much longer conversion and indexing timeframes we observed with expensive library management systems on very expensive UNIX platforms. We are getting about ten times the performance on a server that cost about 10 percent as much as the UNIX server.

Conclusions and future developments

The long-hyped application of XML technology to large library catalogs is finally becoming reality. XML and the model of a documentbase provide a completely new and upscale approach to library systems. It is finally possible to create catalogs that are natively Web-based and not purveyed through the façade of an interface.

This summer, the University of Buffalo Libraries will make the new XML catalog public alongside the traditional Web catalog. Users will have a choice of systems. We will monitor patron reaction and benchmark and compare system utilizations. Users will tell us through their work and feedback, which catalog and features they most appreciate. Continual enhancement of the public catalog will become as efficient as on-going Web development.

Additional plans include experiments for a union catalog. Searches across multiple documentbases are planned. We are looking at this infrastructure as a standard basis for many digitization projects. XML certainly has the breadth to support the fusion of descriptive text with any type of digital object. Thinking in terms of XML documents gets us out of the RDBMS box and into a world structured more like the Web and the kind of scholarly communication that academic libraries exist to facilitate.

The University at Buffalo XML Public Catalog can be accessed with Windows Internet Explorer from: http://libnet.buffalo.edu/

Mark Ludwig(uldmjl@buffalo.edu) is Library Systems Manager, University at Buffalo, State University of New York, Buffalo, New York, USA.

Related articles