The ARROW project

A consortial institutional repository solution, combining open source and proprietary software

The Authors

David Groenewegen, Monash University, Clayton, Australia

Andrew Treloar, Monash University, Clayton, Australia

Abstract

Purpose – To provide an overview of the Australian Research Repositories Online to the World (ARROW) project.

Design/methodology/approach – An retrospective analysis of the first three years of the ARROW project.

Findings – Provides information about the decisions made by the ARROW project, and reviews how they turned out.

Originality/value – This paper provides a review of the first three years of the ARROW project (which was the original funding horizon) from the perspective of the project team.

Article Type:

Case study

Keyword(s):

Resource sharing; Research; Product life cycle; Digital libraries.

Journal:

OCLC Systems & Services: International digital library perspectives

Volume:

24

Number:

1

Year:

2008

pp:

30-39

Copyright ©

Emerald Group Publishing Limited

ISSN:

1065-075X

ARROW overview

Why did ARROW want a repository?

The ARROW project was envisaged in a time when institutional repositories and the software required for them were still in their infancy. In 2003 ePrints (www.eprints.org/) was essentially the only player in the field, although DSpace (www.dspace.org/) had also made its first appearance. Nevertheless, the ARROW partners recognised that there were a number of compelling reasons to look at the repository space, and to start working on options beyond the print equivalences that ePrints were concentrating on. These reasons included:

A project proposal (http://arrow.edu.au/docs/files/ARROW%20project.pdf) was written and submitted to the Australian Commonwealth Department of Education, Science and Training (DEST) (www.dest.gov.au/), under the Systemic Infrastructure Initiative (www.dest.gov.au/sectors/research_sector/programmes_funding/general_funding/research_infrastructure/systemic_infrastructure_initiative.htm) Framework for Australian Higher Education funding scheme. This was approved in 2003, with the funding covering three years until December 31, 2006.

The initial goal of the project is best expressed in this quote from the original proposal:

The ARROW project will identify and test software or solutions to support best practice institutional digital repositories comprising e-prints, digital theses and electronic publishing.

The project has met, and exceeded this basic goal, by producing not only a test version of the software, but a working repository solution that is currently in use at a number of Australian universities, with more to come online shortly.

Who is ARROW?

The ARROW project has been managed by a consortium of Australian institutions: Monash University (lead institution), the National Library of Australia, the University of New South Wales and Swinburne University of Technology. The project currently employs the equivalent of three full time staff to manage the project on a day-to-day basis. The project partners also provide staff time and other in kind services to increase the number of staff available to work on development. These latter staff, however, are primarily responsible for the management of the repository installed at their own site.

Since the beginning of 2006, several other Australian universities have also signed up to use the ARROW solution for institutional repositories. These ARROW members are also working on development projects to develop and enhance the software.

What did the ARROW project set out to achieve?

The ARROW project had a number of specific goals that it wished to achieve. The key one was the need for a solution for storing any digital research output, regardless of format in which it was created. For the sake of simplicity, and as a way to deal with familiar and accessible materials, the initial focus was on digital objects with print equivalents, specifically theses and journal articles. As the solutions for these areas have become clearer the project has been looking at a range of other objects. These include datasets, specifically those produced as a part of research and which might usefully be attached to the published research, as well as learning objects that might need to be organised and made available from a repository.

From the beginning of the project there was a recognition that ARROW needed to be able to deal with more than just open access materials, and that some things stored in repositories need to be restricted for a variety of important reasons, such as copyright, confidentiality or ethical considerations, or because it is work in progress. Therefore the project had done considerable work on the access and authentication issues related to research outputs in digital repositories, often in partnership with the MAMS project (www.melcoe.mq.edu.au/projects/MAMS/). This work is ongoing at time of writing.

The Australian Government has had a system of reporting research for the purpose of tracking the output of universities. At the time the project was conceived, this took the form of reporting eligible research publications. This included the retention of copies at the reporting institution for the purposes of audit. It was envisaged that a repository could be used to help manage this process and to retain the audit copies. Since then there has been a change in direction to a proposed Research Quality Framework (RQF) system (www.dest.gov.au/sectors/research_sector/policies_issues_reviews/key_issues/research_quality_framework/default.htm), which will involve the review of research outputs by experts from outside Australia. DEST have identified that repositories offer the potential for widespread access to these outputs in a less labour intensive fashion, and ARROW has been working with them on how this might be achieved.

A key requirement of the project was to employ open standards to make sure the data stored in the repository would be transferable in the future. In conjunction with this it was determined that the project would develop and deliver open source tools back to the Fedora Community. This was also a requirement of the program under which ARROW was funded.

Of critical importance was the development of an overall solution that could offer on-going technical support and development past the end of the funding period. DEST and the developers of the project were concerned that in many cases projects are not sustainable unless centrally funded, and that this would not be appropriate in this area. The project needed to find a solution that would mean the repository created would have a viable strategy for ongoing sustainability.

The end result of these decisions is a software solution combining open source and proprietary software, made up of open source repository software called Fedora with a proprietary services layer called VITAL, which has been developed by VTLS Inc. This is not a centralised or hosting solution – each ARROW partner or member has their own hardware and software. As each ARROW partner or member is licensing the VITAL software from a commercial provider, they will receive the following benefits after the project ends (assuming they continue to pay the annual license fee):

Building ARROW

ARROW requirements

ARROW wanted:

Fedora

After a careful analysis of the candidates available at the time[1], it was felt that only Fedora provided the right combination of attributes. Fedora™ can best be thought of as services-mediation infrastructure, rather than an off-the-shelf application. It can use web services technology (www.w3.org/2002/ws/) to draw on services provided by other systems as well as expose its own functionality using web services standards. Key to the Fedora™ architecture is its underlying object-based model. Fedora™ stores digital content objects, either as datastreams contained within the repository or as links to external resources. It also stores what Fedora™ calls disseminators. These are stored programs which provide ways to render these digital content objects. As an example, a Fedora™ repository might contain an image disseminator which can take a stored image object and render it on the fly into a thumbnail, a medium-resolution version or a high-resolution version as required. The software maintains bindings between content objects and their disseminators. Each object has a default disseminator (which might just provide the sequence of bits that comprise the object plus a Multi-purpose Internet Mail Extensions (MIME) type (www.ietf.org/rfc/rfc2045.txt), much like a web server). Alternatively, the repository might provide alternative disseminators which will allow the object to be exposed in other ways. An example of this might be a disseminator which exposes the internal structure of an Encoded Archival Description (EAD) (www.loc.gov/ead/) as a navigation mechanism. This architecture, which combines objects and disseminators, is very flexible, and provides significant advantages as a platform on which to build other applications (Lagoze et al., 2005) For more background on the reasons for selecting Fedora for the ARROW project, see Treloar (2005).

Since the beginning of the project ARROW has worked actively and closely with Fedora™ and the Fedora Community. The ARROW Project Technical Architect is a member of the Fedora Advisory Board, which provides long term guidance for the project. This commitment to Fedora is reinforced by VTLS Inc. The VTLS President is a member of the Fedora Advisory Board, and the VITAL Lead Developer is part of the Fedora Development Group.

Open source

As indicated above, the project is creating open source components that interoperate with Fedora as part of its output. Some of these have already appeared:

Others are scheduled to appear in 2006:

Developing with VTLS

ARROW decided that they needed to partner with a developer who could not only produce the software but could also provide ongoing user support and development after December 31, 2006. VTLS were identified by the project team as a suitable partner in the process, and they were interested in working in this area as well. They had already begun work on a repository solution using Fedora, they were familiar with the library sector because of their many years experience in developing an integrated library management system (VIRTUA) and they were willing to produce a combination of a proprietary solution, Fedora and other open source software.

This decision has resulted in VITAL, which is ARROW specified software created and fully supported by VTLS, and built on top of Fedora. This software (as of the date of writing) includes a number of components:

The interrelationships of these components can be seen in Figure 1.

Implementation decisions

During the start-up phase of the project, it was necessary to make a number of decisions about how to construct the ARROW solution. The requirement for many of these implementation decisions was inherent in the repository solution that was chosen. The F in Fedora stands for Flexible. Fedora provides few constraints, but this requires deliberate decisions.

Atomistic or compound objects

The sort of process the ARROW went through in making this decision can be illustrated by the diagram taken from a whiteboard shown in Figure 2.

Fedora objects (broadly speaking) can either be modelled as atomistic, or compound. Atomistic objects consist of an identifier, some metadata and (usually) one datastream. Compound objects consist of an identifier, some metadata, and multiple datastreams of different types. Thus a doctoral thesis as submitted for examination might consist of the bound text of the thesis, and an accompanying video. This could be modelled atomistically as a series of Fedora objects: the abstract as plain text, a PDF of the entire thesis, the XML of the entire thesis, an AVI of the video, and an MOV of the movie. It could also be modelled in a compound way as a single Fedora object which consists of each of the above elements as datastreams within the object.

ARROW elected to choose compound objects, basing its decision around the majority use-cases:

It is anticipated that newer forms of research will lead to more content models and variations.

Descriptive metadata

Early in the project ARROW spent some months examining the idea that a single descriptive metadata schema for all the objects in the ARROW repositories would be a sensible goal. After looking at the strengths and weaknesses of numerous metadata schemas, and on considering the diversity of object types ARROW repositories could be required to store, it was decided that it was more realistic to accept that the project would need to support multiple descriptive metadata schemas. As a result, ARROW has decided to support the metadata generated by communities of practice to accompany their digital objects. This implies that an ARROW repository will contain a range of different metadata schemas attached to different objects. The VITAL software currently transforms MARCXML and ETD-MS metadata into Dublin Core for OAI-PMH and internal purposes. In the longer term, and to support other schemas, ARROW is investigating the possibility of using OCLC's interoperable metadata core (Godby et al., 2003). It is also possible that ARROW may need to write something itself.

Persistent identifiers

After careful consideration of all the available alternatives, ARROW decided to use handles for all the partner university sites. The NLA decided to proceed using its existing persistent identifier scheme. The Handles System (www.handle.net/), developed by CNRI (http://cnri.reston.va.us/), is a comprehensive system for assigning, managing, and resolving persistent identifiers, known as handles, for digital objects and other resources on the internet. Handles can be used as Uniform Resource Names (URNs). Part of the work done by VTLS and released as Open Source has been the addition of handles integration to the Fedora software.

The ARROW repositories were designed from the beginning to be as flexible as possible. To this end, the project decided it would be good practice to be able to persistently cite both objects and components of objects. The ARROW software therefore assigns handles to each entire ARROW object (such as a thesis), and to each component of an ARROW object (such as the metadata, the thesis abstract, the thesis body, and the reference list). This means that repository managers can disaggregate and re-aggregate objects as required in the future without the user being aware of it. It also means that the minimum persistently citeable unit can be made as granular as is required.

External searching and harvesting

One of the project's aims was to develop a discovery service for Australian institutional repositories. This service, which is called the ARROW National Research Discovery Service (http://search.arrow.edu.au/) has been one of the key work areas undertaken by the National Library. It provides a national resource discovery service including:

This service harvests metadata using OAI-PMH from a number of different institutional research repositories at Australian universities. These repositories use a range of software (e-prints.org software, DSpace and Fedora) but all expose their metadata for harvesting. This service is now live and available either through a link from the ARROW website or directly at http://search.arrow.edu.au/. This service allows for searching research outputs across the Australian university sector.

What has the project learnt so far?

A number of valuable lessons have been learnt during the course of the project, even beyond solving the many technical challenges. For instance, working with multiple partners has been very beneficial for the sharing of information and experiences, the sharing of development work and the multiple perspectives on issues of note.

The multiple perspectives on issues, however, have also led to scope creep and difficulty in managing expectations across the group. This has put pressure on the project management team who have acted as intermediaries between the project and the developers. Software development feels slow, both commercial and open source, but this is more a function of being trapped in the middle of it than any failings by the developers or partners. Development with a commercial partner can be tricky as well, as the priorities and needs of commercial and educational partners can occasionally conflict. The nature of standards in this area remain an ongoing problem, as the standards that are in place leave a fair amount of leeway, which has forced the project to spend large amounts of time discussing and trying to refine them for actual use.

The debate over open versus closed repositories, or information management versus accessibility is an ongoing issue, with much work to be done. The key finding is that there is no single rule that will work for all digital objects, and that flexibility will be needed into the future to make repositories effective in all circumstances.

Repositories are only partly about software – advocacy, policy, institutional engagement and grunt work need equal attention. Even with compulsory deposit policies, there continues to be a great deal of work to be done to fill the repository and encourage academics to submit their work. It is clear that no amount of discussion about repositories will fill them – the relevant data needs to be found.

Copyright continues to be am major area of difficulty. Even beyond the commonly discussed issues such as publisher versus author versions, attempts to enter material in areas such as performing arts will present a number of new challenges. For instance, a video of a dance work may have musical, choreography and personal intellectual property involved, any of which may prevent it being added to a repository.

ARROW Stage 2

Funding for the ARROW project has recently been extended until the end of 2007. The goals of the project over that span will be:

Conclusion

The ARROW project has gone from being a project examining repositories to one which has been a substantial contribution to a software solution that is now available at 14 Australian institutions as well as a growing number of institutions across the world. The repository field is much more mature than it was when the project started three years ago, but there is still much to be done. There are still some constraints arising from the immaturity of practice around repositories, and the consequent absence of standards across the sector. In short, the next three years (however they pan out) seem likely to be at least as exciting as the first three.

ImageFigure 1VITAL architecture overview
Figure 1VITAL architecture overview

ImageFigure 2Initial ARROW content modelling
Figure 2Initial ARROW content modelling

References

Godby, C.J., Smith, D., Childress, E. (2003), “Two paths to interoperable metadata”, paper presented at the 2003 Dublin Core Conference, DC-2003: Supporting Communities of Discourse and Practice-Metadata Research & Applications, Seattle, WA, September 28-October 2, .

[Manual request] [Infotrieve]

Lagoze, C., Payette, S., Shin, E., Wilper, C. (2005), "Fedora: an architecture for complex objects and their relationships", Journal of Digital Libraries, special issue on complex objects, available at: www.arxiv.org/abs/cs.DL/0501012, Vol. 6 No.2, pp.124-38.

[Manual request] [Infotrieve]

Treloar, A. (2005), “ARROW targets: institutional repositories, open-source, and web services”, Proceedings of AusWeb05, the Eleventh Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July, available from http://ausweb.scu.edu.au/aw05/papers/refereed/treloar/, .

[Manual request] [Infotrieve]

About the authors

David Groenewegen has been ARROW Project Manager since 2006. Previously he spent a number of years in the areas of electronic information provision and information literacy at Monash University, and in information resources at the University of Ballarat.

Andrew Treloar is the Director and Chief Architect of the Australian ResearCH Enabling enviRonment (ARCHER) project. He is also the Project Architect for the Dataset Acquisition, Accessibility and Annotation e-Research Technologies (DART) project. Prior to these projects he was the lead for the development of the Information Management Strategy at Monash University, where he has worked since 1999. Andrew is the corresponding author and can be contacted at: Andrew.Treloar@its.monash.edu.au