Selective archiving of web resources: a study of processing costs

The Authors

Mirna Willer, University of Zadar, Zadar, Croatia

Tanja Buzina, National and University Library, Zagreb, Croatia

Karolina Holub, National and University Library, Zagreb, Croatia

Jasenka Zajec, National and University Library, Zagreb, Croatia

Miroslav Milinović, University of Zagreb Computing Centre, Zagreb, Croatia

Nebojša Topolščak, University of Zagreb Computing Centre, Zagreb, Croatia

Acknowledgements

The authors wish to thank the colleagues who took part in this assessment exercise: Hrvoje Brozović (IT Department), Danijela Getliher and Renata Petrušić (ISSN), Sofija Klarin (Croatian Institute for Librarianship, Project member), Tatjana Mihalić (Music collection), Robert Ravnić (Authority Control Department), Ingeborg Rudomino (UPWR), Tomica Vrbanc (CIP Unit) and Mirjana Vujić (Subject and Classification Department) from the National and University Library, and Mijo Åerek from the University of Zagreb Computing Centre (SRCE).

Abstract

Purpose – The purpose of this paper is to assess costs in the National and University Library of Croatia for processing Croatian web resources and the maintenance and development of the service, and to analyse the present organisation and workflow of their processing, and to propose improvements.

Design/methodology/approach – The assessment period was two months, during which the members of staff involved minutely monitored their tasks. The results were compared to the same exercise reported by the National Library of Australia and processing costs of cataloguing Croatian print publications.

Findings – The bottom-up analysis of processing web resources shows that a balanced description of tasks and their distribution over staff members was established, and that the present workflow meets the requirements of efficient processing of web resources. As a general finding, approximately the same time was spent on archiving new items, as on the control and maintenance of the already archived ones due to the change of web resource properties, URL instability and the changes of technology. The comparative analysis showed: less time is spent on identification and selection and publishers' contacts on the part of the Croatian National Library compared to the Australian one; almost twice as much time was spent on gathering, quality assurance, and archiving instances in the Australian case than in the Croatian one; practically the same time was spent on cataloguing in both cases; and compared to cataloguing of print publications, significantly less time was spent on the print ones.

Originality/value – The paper is one of the two published articles on the in depth analysis of the workflow and processing costs of managing and selectively archiving legal deposit copies of web resources in a national library. Its potential value is in drawing attention of library managers of those institutions that deal with selective web archiving to assess costs and services in view of the legal obligations of libraries for preserving national cultural web heritage and meeting present and future users' needs.

Article Type:

Research paper

Keyword(s):

Worldwide web; Archiving; Economics; National libraries; University libraries; Croatia.

Journal:

Program: electronic library and information systems

Volume:

42

Number:

4

Year:

2008

pp:

341-364

Copyright ©

Emerald Group Publishing Limited

ISSN:

0033-0337

Background to the digital archive of Croatian web resources

The National and University Library (NUL) of Croatia and the University of Zagreb Computing Centre (SRCE) developed the digital archive of web resources (DAMP) as a result of the co-operative project: Design of the System for Harvesting and Archiving Legal Deposit of Croatian Web Publications. The first phase of the project ran between 2003 and 2005, during which period DAMP was developed and fully integrated with the library information system, the first test version starting in early 2004 (Milinović and Topolščak, 2005; Willer and Milinović, 2005). Further development of DAMP was undertaken during two further 1 year projects with the second one finishing in October 2007.

The NUL decided to archive web resources selectively. The decision for such an approach was made under the following circumstances:

  1. In 1997, Croatia passed a law by which online publications are included within legal deposit.
  2. As the majority of web resources selected by the NUL selects are publicly available on the internet, it can archive web resources without asking for permission from authors/publishers, and only give them notice about archiving. A minor number of publications are password protected.
  3. The NUL started experimenting with cataloguing web resources in 1998 (Klarin and Pigac, 2001) although without archiving them. In the following years the new International Standard for Bibliographic Description (ISBD) standards for electronic and continuing resources, and extensions to the UNIMARC format were implemented. With the change of the NUL's information system in 2006, the UNIMARC format was replaced by MARC 21.
  4. Each web resource to be archived is assessed according to pre-established selection criteria.
  5. Each web resource to be archived is assessed for its appearance (“look and feel”) and functionality to the fullest extent permitted by current technical capabilities.
  6. Each web resource to be archived is fully catalogued and thus retrievable in the integrated catalogue and available for inclusion in the national bibliography.
  7. The types and properties of individual items within DAMP become known to collection managers, thus enhancing their ability to:
    • develop understanding of this type of material;
    • develop/restructure new/existing methods, tools and workflow for their processing;
    • re-assess methods and tools for providing access to them; and
    • develop long-term preservation strategies.

After the end of the first phase of the project, during which the members of the project team from different organisational units took part in the development and setting up of the service, the NUL selected three members of the staff to work full time in a newly created unit for processing web resources (UPWR). The UPWR's tasks were broadly defined as identification, selection, cataloguing and archiving of web resources. The head of the unit was elected from the project team members. The UPWR was supported by the project manager from the NUL's Croatian Institute for Librarianship and staff members from different NUL departments working part-time on web resources, as well as by the development and technical team and the project manager from SRCE. The UPWR's everyday tasks ran in parallel to the third phase of the project.

By mid-May 2007 DAMP contained 1,640 items (titles) with 15,390 harvesting instances (i.e. single gatherings of a title which include the first and all subsequent gatherings of serial or integrating items), 147 web resources had disappeared from the live web since 2004, while ten were password protected, i.e. access was restricted to use within the NUL. The total size of DAMP was 1 TB. A survey performed one year earlier, in May 2006, showed that DAMP contained 1,364 items with 6,041 instances, and a total size of 277 GB. This shows a growth of approximately 300 items per year, with the rise of 9,349 instances from May 2006 to May 2007. The same survey analysed the types of resources and types of formats in DAMP for the first two-year period (2004-2005). The result showed the following distribution of types of resources:

The type of formats used by publishers or authors were (in order of preference):

  1. text/html;
  2. image/jpeg;
  3. image/gif;
  4. application/pdf; and
  5. application/x-java script (Pigac and Buzina, 2006).

Most of the items in DAMP are freely available to anyone, anywhere in the world via the catalogue (http://katalog.nsk.hr/) and DAMP's interface (http://damp.nsk.hr). The technical infrastructure for the authorisation/authentication service is in place to enable access to password protected items in the NUL's reading rooms. In order to allow web search engines to harvest DAMP the OAI-PMH standard was in the implementation phase during the period of this study.

Figure 1 shows a bibliographic record from the NUL's WebPAC with two fields 856 URL, one pointing to the live resource on the internet, and the other to the web archive. Figure 2 shows the English-language version of the latest archived instance of the item – culturenet.

The funding for the development of DAMP, its integration into the catalogue, the purchase of the appropriate hardware and training of the staff involved were all taken from the running library budget.

Study of processing costs

Purpose of the study

The purpose of the study described here was two-fold. Firstly, to assess the costs of processing web resources taking into account the time and type of task per item archived and the average time per task per type of resource processed, as well as to assess other costs related to the maintenance and development of the service. The period of the assessment was two months during which the staff involved minutely monitored their tasks. Based on the research results, the second aim was to analyse and assess the current organisation and workflow of processing web resources and to propose, if necessary, their improvement.

Methodology and staff involvement

The research took into account three types of staff involvement. Staff members of the NUL were from:

The tasks of these members of staff were broadly:

In addition, one member of staff from each of the Serials Cataloguing Department (SCD) and the Information Technology Department (ITD) co-ordinated ISSN assignment and serials cataloguing, and inhouse software development, respectively.

Three staff members from SRCE were involved in:

Only the latter task was reported for this study.

Project team members comprised of the staff from UPWR, CIP, SDC, ITD and the Croatian Institute for Librarianship (one staff member) and project managers from both institutions (two staff members), who were in charge of the management of the project according to the development plan.

The results of assessing costs of processing web resources are compared first to the costs of processing web resources broken down to the same identified tasks as reported by the National Library of Australia (Phillips, 2005) and then to the rough estimate of processing costs for printed monograph and serial publications in the NUL during the same period.

Tasks for web archiving

The current workflow from identification to archiving of web resources is shown in schematic form in Figure 3.

The total number of members of staff working full/part-time on web resources per organisational unit per type of task is shown in Table I. The full details of the tasks are given in the following sections 1-12.

1. Identification

Sources for identification of a possible candidate for the archive, appropriate tasks and staff members involved were outlined.

1.1 Task: search for new web resource

Staff: UPWR, ISSN and Music collection.

Note: this task per item was measured but not taken into account in the final assessment because it was very difficult to estimate. See below 4 Assessment Analysis, and 4.3 Assessment of Costs of Other Activities.

1.2 Task: communication with publishers related to ISSN assignment

Staff: ISSN.

Note: although tasks related to providing CIP, ISBN and International Standard Music Number (ISMN) and related information were expected on the basis of the foreseen workflows, it turned out that during the research period there were no requests by publishers for ISBN and ISMN, while the CIP service retained its decision not to provide CIP records for web resources.

1.3 Task: assessing the online registration of the web resource on the library website filled in by authors/publishers

Staff: UPWR.

Note: it was expected that the Music collection would be involved in this task, but during the research period music publishers did not fill in the form but contacted the staff directly and/or vice versa.

2. Selection

2.0 Task: selection according to the criteria for ISSN assignment, and evaluation of the original resource on the web

Staff: ISSN.

2.1 Task: selection according to the Selection Criteria for Cataloguing and Archiving Web Publications (www.nsk.hr/DigitalLib.aspx?id = 83)

Staff: UPWR and Music collection.

2.2 Task: checking records in the catalogue and archive

Staff: UPWR, ISSN, Music collection.

3. Formal and subject cataloguing

Cataloguing is performed by the staff in the ISSN, Music collection, UPWR, ACD and SCCD.

3.0 Task: assigning ISSN and creating new bibliographic records for serials and integrating continuing resources

Staff: ISSN.

3.1 Task: creating new bibliographic records for monographs, serials and integrating finite resources

Staff: UPWR and Music collection.

Note: UPWR catalogues serials that are outside the ISSN assignment scope.

3.2 Task: amending bibliographic description from minimal to full level record

Staff: UPWR.

3.3 Task: creating new authority records for monographs, serials and integrating finite resources

Staff: UPWR.

3.4 Task: updating existing authority records

Staff: UPWR.

3.5 Task: subject cataloguing and classification

Staff: SCCD and Music collection.

3.6 Task: updating bibliographic records for serials and integrating resources (continuing and finite) on the basis of changes in the resource

Staff: UPWR, ISSN and Music collection.

3.7 Task: checking bibliographic records

Staff: UPWR, ISSN and Music collection.

3.8 Task: checking authority records

Staff: ACD.

Note: tasks involved include checking the accuracy of the choice and form of uniform heading and references, and their content designation in MARC21 Authorities format, researching in order to verify the accuracy/necessary correction of the record, and correcting a record.

4. Archiving

Metadata for the bibliographic identification of an item (i.e. title proper, ISSN, ISBN, ISMN, LIS record identification number), URL on the web, frequency of issuance and its regularity are automatically transferred from the catalogue to DAMP. Only the UPWR staffs are involved in archiving. Archiving comprises of two groups of tasks one related to new items and the other to the archived ones.

4.1 Task: new items – archiving process

4.1.1 checking the item at its live address on the web.

4.1.2 defining the harvesting parameters and registering the item to harvesting queue.

4.1.3 checking the quality of the first harvesting. 4.1.3.1 repeating, when necessary, the harvesting with changed parameters.

4.1.3.2 deleting unsuccessful or poor instances of harvesting.

4.1.3.3 checking the archived item for display in the catalogue and DAMP's web interface.

4.1.4 defining the frequency of harvesting.

4.2 Task: archived items – quality control of archived instances

4.2.1 checking the availability of the item at its live web address according to the automatic monthly report.

4.2.2 changing harvesting parameters if a change in properties/structure has taken place. 4.2.2.1 de-activating harvesting parameters if the web resource has disappeared from the live web;

4.2.2.2 control of the multiple harvesting instances; and

4.2.2.3 deleting unsuccessful harvesting.

4.2.3 checking automatic daily reports for possible duplicates, and deleting them;

4.2.4 changing frequency of harvesting parameters; and

4.2.5 reporting on harvesting problems. Task passed on to SRCE (ref. to 12.3).

5. Updating the catalogue

After an archived item is checked for display in the catalogue and DAMP's web interface, a licensed UPWR member of staff updates the catalogue, usually once a day. This task was planned to be performed during the assessment period. However, at the very beginning of that period the procedure was automated and therefore not recorded here in a sequence of cataloguer's tasks, but as a procedure done in 11.2 Developing New Functionalities of the Digital Archive.

6. Communication with publishers

Communication with publishers involves staff from UPWR, CIP, ISSN and the Music collection.

6.1 Task: recommendations and help provided to publishers relating to different presentation elements of a web resource

Staff: ISSN.

Aspects considered here include: stability of title, existence of basic identification elements, i.e. imprint or its equivalent with the aim to obtain the standard of presentation required for identification and cataloguing.

6.2 Task: assignment of ISSN and notification sent to the publishers

Staff: ISSN.

6.3 Task: reporting to publishers about web resources harvested and archived upon first instance of harvesting

Staff: UPWR, Music collection.

Additionally, the harvesting robot identifies itself upon each instance of harvesting. Publishers are directed to relevant information on the DAMP website related to documents on legal deposit Guidelines for Publishers/Authors, (www.nsk.hr/UserFiles/File/dokumenti/Smjernice%20za%20nakladnike_Verzija%201%200.pdf) as well as being informed of the selection criteria and design of web resources Recommendations for the Creation of Web Publications (www.nsk.hr/DigitalLib.aspx?id = 356). Staff communicate with publishers of serials and report back to the Library/UPWR about postings of a new issue/iteration.

6.4 Task: consulting the publisher on restrictions of access to the web resource

Staff: UPWR and Music collection.

6.5 Task: other types of communication with publishers, e.g. directing a publisher for relevant information to ISSN, ISBN, ISMN units

Staff: CIP unit, UPWR and Music collection.

Note: for giving advice on the presentation and layout of monographic resources; staff involved: UPWR; sending queries to publishers about updating/lack of posting of new content/issues according to publication pattern, sending information about poor harvesting results and referring to the Guidelines and Recommendations; staff involved: UPWR, SRCE.

7. Updating publishers register

7.1 Task: updating publishers register of resources for which no identification number is assigned

Staff: UPWR.

7.2 Task: updating publishers register of resources with an ISSN assigned

Staff: ISSN.

8. Training the library staff and publishers

8.1 Task: training the library staff and the publishers

Staff: UPWR, SRCE, Project manager (SRCE).

9. Promoting DAMP

9.1 Task: Promoting and publishing on the web, in magazines, at conferences, in journals

Staff: UPWR, SRCE, project members, project managers.

Note: during the research period specific topics were recognised as relevant issues for publishing by other staff members of the assessment exercise.

10. Communication within the Library and with other institutions

10.1 Task: exchange of information on publishers and web resources among NUL departments

Staff: all.

This involves meetings and consultations within the departments, and with other institutions and projects related to archiving web resources.

11. Design of the system for harvesting and archiving legal deposit of Croatian web publications: the third phase of the project

The aims of this third phase of the project that ran in parallel to the research study, covered two main groups of tasks: improvement and development of DAMP. The first one included, among others:

The second group of tasks included the development of the application for indexing the content of DAMP according to the OAI-PMH standard (the testing was finished in the previous project period), and a module for monitoring its use.

The groups of tasks were the following:

12. Tasks performed by the SRCE

Tasks comprise analysing websites from the aspect of web technology used and determining the reasons for harvesting failure. In some cases it was possible to find the adequate harvesting parameters, while in other cases the web resource remained “unharvested” until the system was upgraded to accommodate the particular technological challenge.

Sample of web resources and assessment analysis

The analysis of assessing costs for processing web resources is divided into two sets of tasks:

  1. an assessment of costs per item processed; and
  2. an assessment of costs related to other activities, and maintenance and development of the service.

The cost is expressed in units of time required for a particular task. The assessment period was from 15 March to 15 May 2007, which was 42 working days or 315 h (7.5 h/day or 450 min/day). During that period 385 items were processed and recorded in the first set of tasks, i.e. these items passed through one or more tasks, with the final goal of archiving (Figure 4).

Additionally, about 100 items were identified on the web and evaluated for inclusion in DAMP, but did not fulfil the selection criteria, while 14 items were password protected and the publishers/authors did not give permission for full-text access during the assessment period (see above tasks 1.1, 2.0 and 2.1). As these two groups of web resources had not gone through the full processing cycle which would result in archiving, they were not taken into account for the assessment of cost per item processed. The time spent on these tasks was categorised under other activities in the second set of tasks (see 4.3).

Before reporting on the assessment results, it is necessary to present data analysis of the sample web resources.

Data analysis of the sample web resources

The data analysis of the sample web resources takes into account the distribution of the type of resource and type of format, as well as the analysis of frequency of harvesting and number of harvesting parameters. The analysis should be put in relation to the complexity of a web resource (Christensen-Dalsgaard, 2004) and the time spent on its archiving. Figure 5 shows the distribution of the type of resource in the sample with 54.81 percent being integrating sources, 28.57 percent serials and 16.62 percent monographs. The comparable figures for the earlier study (2004-2005) were: 40.33 percent, 30.99 percent and 28.66 percent.

Figure 6 shows the distribution of the type of format with an extremely high percentage of web pages (text, image, sound, video) at 78.70 percent, followed by a significantly small number of doc/pdf (text) at 12.99 percent, and other types of formats at 8.31 percent. This result should be considered together with the analysis of the frequency of harvesting.

The analysis of the frequency of harvesting is indicated in Figure 7 and shows the highest percentage of the so-called manual harvesting at 59.74 percent. Manual harvesting means that the resource is being harvested following a specific request to the harvesting robot by the cataloguer (“harvest only now”), as opposed to automated harvesting which is performed by the harvesting robot according to a pre-defined frequency parameter. As to the automated frequency of harvesting, the sample shows the following distribution: 19.74 percent annually, 10.65 percent monthly, 5.45 percent other, with the smallest percentage of resources being harvested daily at 2.60 percent and weekly at 1.82 percent.

The reason for such a high percentage of manually harvested resources in the sample was two-fold. First, there was a particularly high number of resources, such as websites of institutions, which were harvested and archived only once (see high percentage of websites in Figure 6). Second, there was a number of serials issued irregularly, and the cataloguer was awaiting information from their publisher about the new release of the issue in order to archive them manually.

Figure 8 shows the number of harvesting parameters used. One parameter was used most often (68.31 percent), followed by three parameters (17.14 percent), two parameters (5.19 percent), and four parameters (3.12 percent). The main reason for this is that one parameter is mandatory, to which can be added technically standard and/or stable design of web pages in the sample.

To have a better idea of the tasks involved in web resource capturing related to the frequency of harvesting and number of harvesting parameters, the first eight items in the sample list for harvesting are shown in Table II, while distribution of content type for the first five items on the same list are shown in Table III.

The sum analysis in numbers of items with total number of items in the sample being 385, shows the following. There were 211 integrating resources, followed by 110 serials and 64 monographic resources. The harvesting frequency shows that 230 items were manually harvested and archived, followed by 10 daily, seven weekly, etc. The distribution of a number of parameters shows that for 263 items there was only one parameter defined, for 20 items two parameters, for 66 items three parameters, etc. As to the format type, there were 303 websites harvested and archived, and only 50 doc/pdf format types, as can be seen in Table IV.

Assessment of costs per item processed

The cost assessment exercise had the aim of obtaining the cost per item processed as to the type of task for the sample of 385 items. It was found that the time taken for each task was:

As the archiving tasks showed the highest percentage of staff involvement, a deeper analysis was done and it was found that:

The analysis shows that approximately the same time was spent on archiving new items, as on the control and maintenance of the already archived ones! The reasons for such a result could be found in the change of resource properties, URL instability on the web, and the changes of technology and website design. An example of the latter being a new technology used for a new website design, e.g. Business.hr became un-harvestable after it changed the design, so that further technical analysis from SRCE was required to resolve the problem and to upgrade the system to accommodate the technological change.

The average time per task per type of resource processed, as can be seen in Figure 9, showed that original cataloguing of pdf files took more time than HTML/web files, compared to selection which showed that HTML/web resources demanded slightly more time than pdf files, and archiving (t.4), which took most of the time, showing that HTML/web files are significantly more demanding than pdf files (about 21-14 percent). Identification (t.1) of the two types of resources shows that they require the same amount of time. Average sum of all four tasks (t.1-t.4) shows that HTML/web files require more time for processing than pdf ones.

Assessment of costs of other activities

The second set of tasks that was assessed during the research exercise and that could not have been linked to the processing of a particular item comprised of:

The assessment showed that the time spent broke down in the following way (shown also graphically in Figure 10):

Table V shows the distribution of tasks per library unit in hours.

Findings of the research

The obtained results of assessing costs of processing web resources are compared first to the costs of processing web resources as reported by the NLA–National Library of Australia (Phillips, 2005), and then to the rough estimate of processing costs for printed monograph and serial publications in the NUL during the same period (March-May 2007).

The NLA is, to our knowledge, the only national library and institution that has reported on the topic and, therefore, the only one there is with which to compare our results. While in our study we made two types of task assessment, one linked to the item, and the other to the activities that were related to the archiving tasks and setting up the service, the NLA presented only the first type of task assessment. In the following comparison of the costs of processing web resources in the two libraries (shown also graphically in Table VI), the results of the NUL (in italics) are therefore presented as sums (time/staff member/item + time/staff member) in appropriate groups of tasks:

Another aim of the study was to make a comparison of cataloguing print publications in the NUL. As far as we were aware there were no comparable analyses of processing cost for different tasks pertaining to print publications, so only the cost of cataloguing could be compared. During the 2 months of the research, March-May 2007, an average of ten items (monographs or serial) were catalogued daily. Compared to the obtained results of assessment of web resources, we get the following:

  1. Print publications:
    • Cataloguing: 45 min/staff member/item =ten items/day/staff member (monographs or serial).
  2. Web resources:
    • Identification, selection, cataloguing and archiving: 33.92 min/staff member/item =±14 items/day.
    • Cataloguing: 11.26 min/staff member/item =±28 items/day.

The drastic difference in time spent for a particular task in both comparisons, except for other activities (project development), owes, in principle, to the fact that in our case each item did not pass through all the tasks that were monitored, i.e. the results show the average time spent on a particular task in the life cycle of processing web resources during the assessment period.

In order to get a more realistic assessment of time spent on cataloguing and archiving per item, we decided to perform still another analysis of the sample. In this analysis, the reported time spent on cataloguing and archiving (tasks 3 and 4) was counted only as to the number of items that passed through all stages of both processes.

  1. Cataloguing (formal and subject):
    • New bibliographic records: 31.3 min/staff member/item =±14.3 items/staff member/day.
    • Authority: 13.3 min/staff member/item =±33.8 items/staff member/day.
    • Subject: 10.8 min/staff member/item =±41.7 items/staff member/day.
    • Updating bibliographic records (three staff): 21 min/staff member/item =±24.1 items/staff member/day.

    Total sum: 76.4 min/staff member/item =±5.9 items/staff member/day.

  2. Archiving:
    • New items: 57.9 min/staff member/item =±7.8 items/staff member/day.
    • Archived items: 51 min/staff member/item =±8.8 items/staff member/day.

    Total sum: 108 min/staff member/item =±4.1 items/staff member/day.

Cataloguing and archiving: 184.4 min/staff member/item =±2.4 items/staff member/day (Table VII).

The results are significantly different if compared to the above comparative analyses, i.e. Australian/Croatian results for web resources and Croatian results for print publications, respectively, as shown in Table VIII.

The comparison of given results shows the following:

Conclusions

As a general result of assessment analysis of processing costs, per task, per type of resource, we have come to the following conclusions:

  1. The bottom-up analysis of processing web resources showed that a balanced description of tasks and their distribution over staff members was established, and that the present workflow meets the requirements of efficient processing of web resources.
  2. Average time per task per type of resource processed, shows that only cataloguing pdf files takes more time than HTML/web files, while the average sum of all four tasks (identification, selection, cataloguing and archiving) shows that HTML/web files require more time for processing than pdf ones.
  3. Cataloguing: the assessment results show:
    • relatively small number of entries (resources) compared to print publications;
    • almost the same time taken for original cataloguing and updating bibliographic records due to the changes of resource characteristics: specific to web resources vs print publications;
    • the average for original cataloguing and updating archived records shows that about one third more time is used for web resources than for the print ones.
  4. Archiving: the assessment results show:
    • high percentage of time used for archiving (server errors, requested resource resides under different URL, resource becomes permanently unavailable, or new design of website requires new harvesting parameters);
    • approximately the same time was spent on archiving new items, as on the control and maintenance of the already archived ones;
    • frequency of harvesting is in reverse proportion to the quality assurance of archived instances (i.e. higher frequency of harvesting less possibility to control the quality of archived instances);
    • further training of the cataloguing staff is needed; and
    • employment of staff with technical skills – knowledge of web technology and techniques – in the UPWR.
  5. Development: a high percentage of time of the UPWR, and to a lesser degree ISSN, dedicated to development (research, services and tools – guidelines for cataloguing) shows that UPWR staff members have taken over these tasks from the project members as part of their everyday activities. Percentage of time used by UPWR (36.5 h per 1,5 staff member), and ISSN Centre for Croatia and Serials Cataloguing Department (10.5 h per two staff members) vs percentage of time of the co-ordinator and Project member in the Croatian Institute for Librarianship (22.8 h + 2.5 h per two staff members).

Selective archiving, apart from having advantages of full control over the management of web resources, shows some disadvantages. These are labour intensive archiving procedures and a relatively small number of processed web resources leave other parts of the national web domain unpreserved. To overcome the limitations of such an approach and to meet legal deposit principles of comprehensive coverage of national web collections, it is necessary to start bulk harvesting of the entire national domain.

ImageFigure 1Display of bibliographic record in MARC format from NUL's WebPAC
Figure 1Display of bibliographic record in MARC format from NUL's WebPAC

ImageFigure 2Home page of the English-language version of culturenet
Figure 2Home page of the English-language version of culturenet

ImageFigure 3From identification to archiving of Croatian web resources: a schematic of the workflow
Figure 3From identification to archiving of Croatian web resources: a schematic of the workflow

ImageFigure 4Distribution of tasks from 1 (Identification) to 4 (Archiving) per item processed in minutes
Figure 4Distribution of tasks from 1 (Identification) to 4 (Archiving) per item processed in minutes

ImageFigure 5Distribution of the type of resources in the sample
Figure 5Distribution of the type of resources in the sample

ImageFigure 6Distribution of the type of format
Figure 6Distribution of the type of format

ImageFigure 7Frequency of harvesting per item
Figure 7Frequency of harvesting per item

ImageFigure 8Number of harvesting parameters used per item
Figure 8Number of harvesting parameters used per item

ImageFigure 9Average time per task per type of resource processed
Figure 9Average time per task per type of resource processed

ImageFigure 10Percentage of time for types of tasks performed by a library unit
Figure 10Percentage of time for types of tasks performed by a library unit

ImageTable IStaff involved in the identified tasks
Table IStaff involved in the identified tasks

ImageTable IIA sample of first eight items on the list to be harvested showing the frequency of harvesting (the frequency and its coded value), and the number and name of harvesting parameters used per item
Table IIA sample of first eight items on the list to be harvested showing the frequency of harvesting (the frequency and its coded value), and the number and name of harvesting parameters used per item

ImageTable IIIDistribution of content type
Table IIIDistribution of content type

ImageTable IVDistribution of types of resources, harvesting frequency, number of parameters and format type in numbers of items processed
Table IVDistribution of types of resources, harvesting frequency, number of parameters and format type in numbers of items processed

ImageTable VDistribution of tasks per library unit in hours
Table VDistribution of tasks per library unit in hours

ImageTable VIComparison of the costs of processing web resources
Table VIComparison of the costs of processing web resources

ImageTable VIICataloguing and archiving per person in minutes
Table VIICataloguing and archiving per person in minutes

ImageTable VIIIComparison of the costs of processing web resources that passed through all stages of cataloguing and archiving
Table VIIIComparison of the costs of processing web resources that passed through all stages of cataloguing and archiving

References

Christensen-Dalsgaard, B. (2004), "Web archive activities in Denmark", RLG DigiNews, available at: http://digitalarchive.oclc.org/da/ViewObjectMain.jsp?fileid = 0000070519:000006288124&reqid = 633, Vol. 8 No.3, .

[Manual request] [Infotrieve]

Klarin, S., Pigac, S. (2001), "Hrvatske daljinski dostupne elektroničke serijske publikacije [The Croatian remotely accessible electronic serial publications]", Vjesnik bibliotekara Hrvatske, Vol. 43 No.4, pp.156-67.

[Manual request] [Infotrieve]

Milinović, M., Topolščak, N. (2005), "The architecture of DAMP: a system for harvesting and archiving web publications", What in Scotland and What Is Needed – WIDWISAWN, available at: http://widwisawn.cdlr.strath.ac.uk/Issues/Vol3/issue3_3_1.html, Vol. 3 No.3, .

[Manual request] [Infotrieve]

Phillips, M.E. (2005), "Selective archiving of web resources: a study of acquisition costs at the National Library of Australia", RLG DigiNews, available at: www.rlg.org/en/page.php?Page_ID = 20666#article0, Vol. 9 No.13, .

[Manual request] [Infotrieve]

Pigac, S., Buzina, T. (2006), "Selektivno arhiviranje hrvatskog weba: rezultati i otvorena pitanja (Selective archiving of Croatian web: results and open questions)", in Willer, M., Zenić, I. (Eds),9. seminar Arhivi, knjižnice, muzeji: mogućnosti suradnje u okruženju globalne informacijske infrastructure: zbornik radova, Hrvatsko knjižničarsko društvo, Zagreb, pp.28-39.

[Manual request] [Infotrieve]

Willer, M., Milinović, M. (2005), "Prema trećoj generaciji knjižnično-informacijskih sustava: hibridna knjižnica za hibridne usluge (Towards the third generation of library information systems: hybrid library for hybrid services)", in Katić, T. (Eds),8. seminar Arhivi, knjižnice, muzeji: mogućnosti suradnje u okruženju globalne informacijske infrastrukture: zbornik radova, Hrvatsko knjižničarsko društvo, Zagreb, pp.36-67.

[Manual request] [Infotrieve]

Further Reading

Nacionalna i sveučilišna knjižnica (Zagreb) (n.d.) "Kriteriji odabira obveznog primjerka mrežne grade za obradu i arhiviranje (“Selection criteria for the choice of the legal deposit copy of web publications”)", available at: www.nsk.hr/DigitalLib.aspx?id = 83, .

[Manual request] [Infotrieve]

Corresponding author

Mirna Willer can be contacted at: mwiller@unizd.hr