To read this content please select one of the options below:

Lost in migration: document quality for batch conversion to PDF/A

Roland Erwin Suri (ETH Library, ETH Zurich, Switzerland)
Mohamed El-Saad (ETH Library, ETH Zurich, Switzerland)

Library Hi Tech

ISSN: 0737-8831

Article publication date: 6 June 2018

Issue publication date: 21 June 2021

1831

Abstract

Purpose

Changes in file format specifications challenge long-term preservation of digital documents. Digital archives thus often focus on specific file formats that are well suited for long-term preservation, such as the PDF/A format. Since only few customers submit PDF/A files, digital archives may consider converting submitted files to the PDF/A format. The paper aims to discuss these issues.

Design/methodology/approach

The authors evaluated three software tools for batch conversion of common file formats to PDF/A-1b: LuraTech PDF Compressor, Adobe Acrobat XI Pro and 3-HeightsTM Document Converter by PDF Tools. The test set consisted of 80 files, with 10 files each of the eight file types JPEG, MS PowerPoint, PDF, PNG, MS Word, MS Excel, MSG and “web page.”

Findings

Batch processing was sometimes hindered by stops that required manual interference. Depending on the software tool, three to four of these stops occurred during batch processing of the 80 test files. Furthermore, the conversion tools sometimes failed to produce output files even for supported file formats: three (Adobe Pro) up to seven (LuraTech and 3-HeightsTM) PDF/A-1b files were not produced. Since Adobe Pro does not convert e-mails, a total of 213 PDF/A-1b files were produced. The faithfulness of each conversion was investigated by comparing the visual appearance of the input document with that of the produced PDF/A-1b document on a computer screen. Meticulous visual inspection revealed that the conversion to PDF/A-1b impaired the information content in 24 of the converted 213 files (11 percent). These reproducibility errors included loss of links, loss of other document content (unreadable characters, missing text, document part missing), updated fields (reflecting time and folder of conversion), vector graphics issues and spelling errors.

Originality/value

These results indicate that large-scale batch conversions of heterogeneous files to PDF/A-1b cause complex issues that need to be addressed for each individual file. Even with considerable efforts, some information loss seems unavoidable if large numbers of files from heterogeneous sources are migrated to the PDF/A-1b format.

Keywords

Citation

Suri, R.E. and El-Saad, M. (2021), "Lost in migration: document quality for batch conversion to PDF/A", Library Hi Tech, Vol. 39 No. 2, pp. 337-351. https://doi.org/10.1108/LHT-10-2017-0220

Publisher

:

Emerald Publishing Limited

Copyright © 2018, Emerald Publishing Limited

Related articles