The ARROW project
A consortial institutional repository solution, combining open source and proprietary software
The Authors
David Groenewegen, Monash University, Clayton, Australia
Andrew Treloar, Monash University, Clayton, Australia
Abstract
Purpose – To provide an overview of the Australian Research Repositories Online to the World (ARROW) project.
Design/methodology/approach – An retrospective analysis of the first three years of the ARROW project.
Findings – Provides information about the decisions made by the ARROW project, and reviews how they turned out.
Originality/value – This paper provides a review of the first three years of the ARROW project (which was the original funding horizon) from the perspective of the project team.
Article Type:
Case study
Keyword(s):
Resource sharing; Research; Product life cycle; Digital libraries.
Journal:
OCLC Systems & Services: International digital library perspectives
Volume:
24
Number:
1
Year:
2008
pp:
30-39
Copyright ©
Emerald Group Publishing Limited
ISSN:
1065-075X
ARROW overview
Why did ARROW want a repository?
The ARROW project was envisaged in a time when institutional repositories and the software required for them were still in their infancy. In 2003 ePrints (www.eprints.org/) was essentially the only player in the field, although DSpace (www.dspace.org/) had also made its first appearance. Nevertheless, the ARROW partners recognised that there were a number of compelling reasons to look at the repository space, and to start working on options beyond the print equivalences that ePrints were concentrating on. These reasons included:
- the ability to provide a platform for promoting research output in the ARROW context;
- a way of safeguarding digital information;
- a “place” to gather an institution's research output into one place;
- provision for consistent ways of finding similar objects;
- a method to allow information to be preserved over the long term;
- a method to allow information from many repositories to be gathered and searched in one step;
- enabling resources to be shared, while respecting access constraints; and
- ways of enabling effective communication and collaboration between researchers.
A project proposal (http://arrow.edu.au/docs/files/ARROW%20project.pdf) was written and submitted to the Australian Commonwealth Department of Education, Science and Training (DEST) (www.dest.gov.au/), under the Systemic Infrastructure Initiative (www.dest.gov.au/sectors/research_sector/programmes_funding/general_funding/research_infrastructure/systemic_infrastructure_initiative.htm) Framework for Australian Higher Education funding scheme. This was approved in 2003, with the funding covering three years until December 31, 2006.
The initial goal of the project is best expressed in this quote from the original proposal:
The ARROW project will identify and test software or solutions to support best practice institutional digital repositories comprising e-prints, digital theses and electronic publishing.
The project has met, and exceeded this basic goal, by producing not only a test version of the software, but a working repository solution that is currently in use at a number of Australian universities, with more to come online shortly.
Who is ARROW?
The ARROW project has been managed by a consortium of Australian institutions: Monash University (lead institution), the National Library of Australia, the University of New South Wales and Swinburne University of Technology. The project currently employs the equivalent of three full time staff to manage the project on a day-to-day basis. The project partners also provide staff time and other in kind services to increase the number of staff available to work on development. These latter staff, however, are primarily responsible for the management of the repository installed at their own site.
Since the beginning of 2006, several other Australian universities have also signed up to use the ARROW solution for institutional repositories. These ARROW members are also working on development projects to develop and enhance the software.
What did the ARROW project set out to achieve?
The ARROW project had a number of specific goals that it wished to achieve. The key one was the need for a solution for storing any digital research output, regardless of format in which it was created. For the sake of simplicity, and as a way to deal with familiar and accessible materials, the initial focus was on digital objects with print equivalents, specifically theses and journal articles. As the solutions for these areas have become clearer the project has been looking at a range of other objects. These include datasets, specifically those produced as a part of research and which might usefully be attached to the published research, as well as learning objects that might need to be organised and made available from a repository.
From the beginning of the project there was a recognition that ARROW needed to be able to deal with more than just open access materials, and that some things stored in repositories need to be restricted for a variety of important reasons, such as copyright, confidentiality or ethical considerations, or because it is work in progress. Therefore the project had done considerable work on the access and authentication issues related to research outputs in digital repositories, often in partnership with the MAMS project (www.melcoe.mq.edu.au/projects/MAMS/). This work is ongoing at time of writing.
The Australian Government has had a system of reporting research for the purpose of tracking the output of universities. At the time the project was conceived, this took the form of reporting eligible research publications. This included the retention of copies at the reporting institution for the purposes of audit. It was envisaged that a repository could be used to help manage this process and to retain the audit copies. Since then there has been a change in direction to a proposed Research Quality Framework (RQF) system (www.dest.gov.au/sectors/research_sector/policies_issues_reviews/key_issues/research_quality_framework/default.htm), which will involve the review of research outputs by experts from outside Australia. DEST have identified that repositories offer the potential for widespread access to these outputs in a less labour intensive fashion, and ARROW has been working with them on how this might be achieved.
A key requirement of the project was to employ open standards to make sure the data stored in the repository would be transferable in the future. In conjunction with this it was determined that the project would develop and deliver open source tools back to the Fedora Community. This was also a requirement of the program under which ARROW was funded.
Of critical importance was the development of an overall solution that could offer on-going technical support and development past the end of the funding period. DEST and the developers of the project were concerned that in many cases projects are not sustainable unless centrally funded, and that this would not be appropriate in this area. The project needed to find a solution that would mean the repository created would have a viable strategy for ongoing sustainability.
The end result of these decisions is a software solution combining open source and proprietary software, made up of open source repository software called Fedora with a proprietary services layer called VITAL, which has been developed by VTLS Inc. This is not a centralised or hosting solution – each ARROW partner or member has their own hardware and software. As each ARROW partner or member is licensing the VITAL software from a commercial provider, they will receive the following benefits after the project ends (assuming they continue to pay the annual license fee):
- installation support;
- helpdesk and customer service;
- new features included in successive versions of the software.
Building ARROW
ARROW requirements
ARROW wanted:
- a robust, well architected underlying platform;
- a flexible object-oriented data model;
- to be able to have persistent identifiers down to the level of individual datastreams, accommodating its compound content model;
- to be able to version both content and disseminators (think of software behaviours for content);
- clean and open exposure of APIs with well-documented SOAP/REST web services.
Fedora
After a careful analysis of the candidates available at the time
Since the beginning of the project ARROW has worked actively and closely with Fedora™ and the Fedora Community. The ARROW Project Technical Architect is a member of the Fedora Advisory Board, which provides long term guidance for the project. This commitment to Fedora is reinforced by VTLS Inc. The VTLS President is a member of the Fedora Advisory Board, and the VITAL Lead Developer is part of the Fedora Development Group.
Open source
As indicated above, the project is creating open source components that interoperate with Fedora as part of its output. Some of these have already appeared:
- SRU/SRW;
- HANDLES;
- JHOVE Metadata extraction;
- exposure to web indexing crawlers;
- VALET for web self-submission.
Others are scheduled to appear in 2006:
- LDAP authentication;
- administrative reporting;
- bulk citation export;
- statistics for public users;
- metadata synchronisation.
Developing with VTLS
ARROW decided that they needed to partner with a developer who could not only produce the software but could also provide ongoing user support and development after December 31, 2006. VTLS were identified by the project team as a suitable partner in the process, and they were interested in working in this area as well. They had already begun work on a repository solution using Fedora, they were familiar with the library sector because of their many years experience in developing an integrated library management system (VIRTUA) and they were willing to produce a combination of a proprietary solution, Fedora and other open source software.
This decision has resulted in VITAL, which is ARROW specified software created and fully supported by VTLS, and built on top of Fedora. This software (as of the date of writing) includes a number of components:
- VITAL Manager – a Windows-based management tool, that allows for ingest, management, editing and deletion of objects in the repository.
- VITAL Portal – web-based tool for indexing and managing the repository.
- VITAL Access Portal – web-based searching front end for the repository.
- VALET – web-based self-submission tool.
- Batch loader tool – tool for ingesting multiple similar objects into the repository in bulk.
- Handles server – uses the CNRI technology
[2] to create persistent identifiers into the repository. - Google indexing and exposure – to allow indexing of objects in the repository.
- SRU/SRW support – to allow for other searching and harvesting of the repository.
The interrelationships of these components can be seen in Figure 1.
Implementation decisions
During the start-up phase of the project, it was necessary to make a number of decisions about how to construct the ARROW solution. The requirement for many of these implementation decisions was inherent in the repository solution that was chosen. The F in Fedora stands for Flexible. Fedora provides few constraints, but this requires deliberate decisions.
Atomistic or compound objects
The sort of process the ARROW went through in making this decision can be illustrated by the diagram taken from a whiteboard shown in Figure 2.
Fedora objects (broadly speaking) can either be modelled as atomistic, or compound. Atomistic objects consist of an identifier, some metadata and (usually) one datastream. Compound objects consist of an identifier, some metadata, and multiple datastreams of different types. Thus a doctoral thesis as submitted for examination might consist of the bound text of the thesis, and an accompanying video. This could be modelled atomistically as a series of Fedora objects: the abstract as plain text, a PDF of the entire thesis, the XML of the entire thesis, an AVI of the video, and an MOV of the movie. It could also be modelled in a compound way as a single Fedora object which consists of each of the above elements as datastreams within the object.
ARROW elected to choose compound objects, basing its decision around the majority use-cases:
- journal articles;
- conference papers;
- working papers;
- books;
- book chapters;
- theses.
It is anticipated that newer forms of research will lead to more content models and variations.
Descriptive metadata
Early in the project ARROW spent some months examining the idea that a single descriptive metadata schema for all the objects in the ARROW repositories would be a sensible goal. After looking at the strengths and weaknesses of numerous metadata schemas, and on considering the diversity of object types ARROW repositories could be required to store, it was decided that it was more realistic to accept that the project would need to support multiple descriptive metadata schemas. As a result, ARROW has decided to support the metadata generated by communities of practice to accompany their digital objects. This implies that an ARROW repository will contain a range of different metadata schemas attached to different objects. The VITAL software currently transforms MARCXML and ETD-MS metadata into Dublin Core for OAI-PMH and internal purposes. In the longer term, and to support other schemas, ARROW is investigating the possibility of using OCLC's interoperable metadata core (Godby et al., 2003). It is also possible that ARROW may need to write something itself.
Persistent identifiers
After careful consideration of all the available alternatives, ARROW decided to use handles for all the partner university sites. The NLA decided to proceed using its existing persistent identifier scheme. The Handles System (www.handle.net/), developed by CNRI (http://cnri.reston.va.us/), is a comprehensive system for assigning, managing, and resolving persistent identifiers, known as handles, for digital objects and other resources on the internet. Handles can be used as Uniform Resource Names (URNs). Part of the work done by VTLS and released as Open Source has been the addition of handles integration to the Fedora software.
The ARROW repositories were designed from the beginning to be as flexible as possible. To this end, the project decided it would be good practice to be able to persistently cite both objects and components of objects. The ARROW software therefore assigns handles to each entire ARROW object (such as a thesis), and to each component of an ARROW object (such as the metadata, the thesis abstract, the thesis body, and the reference list). This means that repository managers can disaggregate and re-aggregate objects as required in the future without the user being aware of it. It also means that the minimum persistently citeable unit can be made as granular as is required.
External searching and harvesting
One of the project's aims was to develop a discovery service for Australian institutional repositories. This service, which is called the ARROW National Research Discovery Service (http://search.arrow.edu.au/) has been one of the key work areas undertaken by the National Library. It provides a national resource discovery service including:
- provision of an appropriate search interface, including simple search, advanced search, and browse options;
- contributing metadata and gateways to other networks, such as OAIster, Yahoo, Google;
- ensuring appropriate local institutional and national “branding” of the service, which occurs throughout the ADS interface and the exchanged metadata;
- providing appropriate subject-based access, based on the Australian Standard Research Classification list.
This service harvests metadata using OAI-PMH from a number of different institutional research repositories at Australian universities. These repositories use a range of software (e-prints.org software, DSpace and Fedora) but all expose their metadata for harvesting. This service is now live and available either through a link from the ARROW website or directly at http://search.arrow.edu.au/. This service allows for searching research outputs across the Australian university sector.
What has the project learnt so far?
A number of valuable lessons have been learnt during the course of the project, even beyond solving the many technical challenges. For instance, working with multiple partners has been very beneficial for the sharing of information and experiences, the sharing of development work and the multiple perspectives on issues of note.
The multiple perspectives on issues, however, have also led to scope creep and difficulty in managing expectations across the group. This has put pressure on the project management team who have acted as intermediaries between the project and the developers. Software development feels slow, both commercial and open source, but this is more a function of being trapped in the middle of it than any failings by the developers or partners. Development with a commercial partner can be tricky as well, as the priorities and needs of commercial and educational partners can occasionally conflict. The nature of standards in this area remain an ongoing problem, as the standards that are in place leave a fair amount of leeway, which has forced the project to spend large amounts of time discussing and trying to refine them for actual use.
The debate over open versus closed repositories, or information management versus accessibility is an ongoing issue, with much work to be done. The key finding is that there is no single rule that will work for all digital objects, and that flexibility will be needed into the future to make repositories effective in all circumstances.
Repositories are only partly about software – advocacy, policy, institutional engagement and grunt work need equal attention. Even with compulsory deposit policies, there continues to be a great deal of work to be done to fill the repository and encourage academics to submit their work. It is clear that no amount of discussion about repositories will fill them – the relevant data needs to be found.
Copyright continues to be am major area of difficulty. Even beyond the commonly discussed issues such as publisher versus author versions, attempts to enter material in areas such as performing arts will present a number of new challenges. For instance, a video of a dance work may have musical, choreography and personal intellectual property involved, any of which may prevent it being added to a repository.
ARROW Stage 2
Funding for the ARROW project has recently been extended until the end of 2007. The goals of the project over that span will be:
- Supporting the use of repositories in the Research Quality Framework (RQF) exercise, currently expected to begin in 2008.
- Creative development of institutional repositories, which will entail funding the development of enhancements and innovations in the use of the ARROW solution and its software.
- Supporting Australian engagement with institutional repositories, in order to build on the work done thus far, and to grow the use of repositories in the higher education sector. A key area of this work is the formation of an ARROW Community, a structure designed to share experiences and knowledge between the various ARROW members.
- Building partnerships with relevant projects in Australia and around the world to further enhance repositories.
- The Persistent Identifiers and Linking INfrastructure (PILIN) sub-project, being undertaken by ARROW in partnership with the University of Southern Queensland, which is studying the feasibility of a sustainable identifier infrastructure for the whole of Australia.
Conclusion
The ARROW project has gone from being a project examining repositories to one which has been a substantial contribution to a software solution that is now available at 14 Australian institutions as well as a growing number of institutions across the world. The repository field is much more mature than it was when the project started three years ago, but there is still much to be done. There are still some constraints arising from the immaturity of practice around repositories, and the consequent absence of standards across the sector. In short, the next three years (however they pan out) seem likely to be at least as exciting as the first three.
Figure 1VITAL architecture overview
Figure 2Initial ARROW content modelling
References
Godby, C.J., Smith, D., Childress, E. (2003), “Two paths to interoperable metadata”, paper presented at the 2003 Dublin Core Conference, DC-2003: Supporting Communities of Discourse and Practice-Metadata Research & Applications, Seattle, WA, September 28-October 2, .
Lagoze, C., Payette, S., Shin, E., Wilper, C. (2005), "Fedora: an architecture for complex objects and their relationships", Journal of Digital Libraries, special issue on complex objects, available at: www.arxiv.org/abs/cs.DL/0501012, Vol. 6 No.2, pp.124-38.
Treloar, A. (2005), “ARROW targets: institutional repositories, open-source, and web services”, Proceedings of AusWeb05, the Eleventh Australian World Wide Web Conference, Southern Cross University Press, Southern Cross University, July, available from http://ausweb.scu.edu.au/aw05/papers/refereed/treloar/, .
About the authors
David Groenewegen has been ARROW Project Manager since 2006. Previously he spent a number of years in the areas of electronic information provision and information literacy at Monash University, and in information resources at the University of Ballarat.
Andrew Treloar is the Director and Chief Architect of the Australian ResearCH Enabling enviRonment (ARCHER) project. He is also the Project Architect for the Dataset Acquisition, Accessibility and Annotation e-Research Technologies (DART) project. Prior to these projects he was the lead for the development of the Information Management Strategy at Monash University, where he has worked since 1999. Andrew is the corresponding author and can be contacted at: Andrew.Treloar@its.monash.edu.au