Gail Thornburg, OCLC, Dublin, Ohio, USA
W. Michael Oskins, OCLC, Dublin, Ohio, USA
The authors would like to thank Rich Greene, Janifer Gatenby and Jay Weitz for their patient help in clarifying music cataloging points, and for reading this paper.
Purpose – Describing musical pieces, whether sound recordings, scores, librettos, videos, has always involved cataloger interpretation and judgment. There is considerable variation in records created for exactly the same item. And there is never “proof” that two records which seem to describe the same item actually do. This paper aims to address this issue.
Design/methodology/approach – This paper describes some of the challenges encountered in developing software for matching music records, and some approaches to making the software reliable.
Findings – The paper finds that matching can be used successfully to create GLIMIR clusters in the WorldCat database. Work is needed in several areas to complete the implementation, but intermediate results are promising.
Originality/value – This implementation will allow end-user applications to collocate resources, to improve discovery and delivery in a complex bibliographic universe
Clustering; Matching; Music; Cataloguing; Computer software.
OCLC Systems & Services: International digital library perspectives
Emerald Group Publishing Limited
The process and characteristics of matching and deduping records have been surveyed previously in, for example, O'Neill et al. (1993), and Naumann and Herschel (2010). An interesting discussion of matching techniques for large music databases is found in (Uitdenbogerd and Zobel, 1999). This explores similarity applied to the location of particular melodies in collections of MIDI files.
This paper focuses on challenges in music, in relation to the matching of bibliographic records and how this can vary depending on the purpose of the matching. Moreover, each record is not music, but describes music in a member library or institution's collection.
What do we mean by music?
The OCLC Extended WorldCat database (XWC) contains records for bibliographic items, authority control records, articles loaded from other sources, and other information. This discussion concerns only the category of bibliographic records, that is, descriptions of items that exist outside XWC. Within this category are records for many forms of item: books, videos, artwork, computer software, realia, and others.
The creation of records for “music” include sound recordings of music, in specific formats such as tape, disc, phonograph records, electronic, even wire and wax cylinder recordings. It includes scores for musical compositions, which may be published music or manuscript. It includes librettos for opera compositions. It includes videos of performances.
Why is this difficult?
Differing rules of record creation are one source of variance. That is two records for the identical piece of music may read differently. Cataloguers creating the bibliographic records for musical items in this country often use rules described by AACR2, that is, the Anglo American Cataloguing Rules Second Edition (ALA, 2002). This glosses over a wealth of interpretations and local practices. Other countries may use different cataloging codes, such as RAK (Regeln für die alphabetische Katalogisierung) in Germany (Deutsche Nationalbibliothek, 2006). These introduce differences, but the codes generally are based on the International Standard Bibliographic Description (ISBD) so they have a general similarity. Another point here is that the rules and policies that interpret the cataloging rules have changed over time; perfectly good records created for the same item using AACR2 may be difficult to match.
The records created may also have been created in different formats, such as MARC (Library of Congress, 2010), or UNIMARC, a variant of MARC often used in Europe, or something else, such as national implementations of MARC. While the aim of MARC format is to define where specific pieces of record description, such as title or publisher, belong in a record, library and individual practices may introduce exceptions to what is considered a properly constructed record. Record descriptions in multiple languages are a challenge, particularly those containing abbreviations, which are frequently observed in statement of responsibility and imprint or collation fields.
Moreover, many of the MARC record fields used in a bibliographic record are free text, and while instructions exist as to descriptions in these fields, much is interpretation/judgment of the individual librarian. Any software trying to equate two free text notes fields in two records face a challenge to determine what is significant.
The problem description of this paper will outline the goals of varying types of matching implemented thus far, and then describe specific needs and challenges for matching music records.
Types and goals of matching in XWC
There are several distinct flavors of matching software in place. That is, there is one matching architecture and set of software described here, but each type is profiled to behave according to different rules. This paper will describe only types which pertain to music.
- Batch loading of records: Match a record from an institution to some “appropriate” record in WorldCat so that holdings may be set for a library. If a match cannot be found a new record may be added. Often millions of records are processed per day. The records may be for any format including music, and described in any language. This processing compares a candidate match to the incoming record point by point, and stops considering a record upon any point of mismatch resulting in “sudden death” (Thornburg, 2005) for efficiency in processing.
- Matching to merge duplicates, also known as Duplicate Detection and Resolution (XWCDDR): Find, compare records, and merge the small percentage which seem safe to call “equal.” This is more conservative than regular matching, as splitting apart records is a significant effort if the merge proved to be wrong. Naumann and Herschel (2010) define the goal of duplicate detection as “partitioning into sets of candidates where each set represents a different real-world object and all candidates within such a set are different representations of the same real-world object.” Of course the XWCDDR matching process has no direct knowledge of the real world objects, only second hand descriptions, so this process is heuristic, not definitive.
- Discovery: Find an item, using a query converted to a “record” to match with records for the item that the querying institution holds. This is closer to a reference use of matching, one query/record at a time, in contrast with batch loading of records. The queries tend to be converted to a record which is rather skeletal. And the querying software prefers to get the “best” match rather than an exhaustive list.
- Clustering: The goal is to match for creation of GLIMIR ids (Global Library Manifestation Identifier). This is for collocation, that is, to cluster together different descriptions of the same resource and to cluster together different resources having essentially the same content. These clusterings are composed of: records which might seem the same to an end user but are not quite similar enough to allow machine deduplicating; different forms of the same content; or records for the same item, but cataloged in different languages. Grouping of these similar items can unify displays for users. This is a radically different use of matching. GLIMIR will produce two new identifiers, indexes inWorldCat, that will also allow libraries to achieve these useful groupings in their own collections (Gatenby, 2010).
GLIMIR is an interesting use of matching software in that it takes a “big picture” view of record comparison. GLIMIR matching tests have identifed enhancements that can also be applied to matching generally.
The immediate goals of GLIMIR are:
- to cluster together different descriptions of the same resource and to get a clearer picture of the number of actual manifestations in WorldCat so as to allow the selection of the most appropriate description; and
- to cluster together different resources with the same content to improve discovery and delivery for end users.
The ultimate goal of GLIMIR is to link resources in different sites with a single identifier to cluster hits and thereby maximize the rank of library resources in the web sphere (Greene, 2010). GLIMIR is related conceptually to the FRBR model. FRBR, Functional Requirements for Bibliographic Records, is an entity-relationship model which tries to represent the bibliographic universe (see Zhang and Salaba, 2009). One of the goals of FRBR is to improve the grouping together of similar items. In the same way, GLIMIR tries to group similar items at the sub-work level. In this way groups of very similar items can be displayed in a clearer way to users, and navigation enhanced.
The complexities of the music bibliographic universe, and its affinities with FRBR concepts, are reviewed in (Vellucci, 2007). The author discusses the problems of aggregate works and the whole/part relationship, as well as the complex nature of representing musical expressions of a given work.
Actions matching needs to work well
- Make allowances.
Retrieve: A system cannot evaluate potential match records if it cannot find them in the vastness of the XWC database. Many records can be retrieved by unique key, but often the incoming data has no unique numbers.
The query builder software, written for matching, dynamically creates the most extensive Boolean combination of free text search terms it can. These are mined from the incoming record. On the fly it creates a list of several queries, and keeps trying combinations until it succeeds. That is, it applies successive, stateless searches until matching considers it is “done”.
The challenge is dynamic optimization: If the query created is too explicit, legitimate matches may not be retrieved. An example would be failure to retrieve using a query with a title clause, which differs from a desired match due to minor but unpredictable wording variants. If the query is too general, the number of records retrieved is too large to process. In classical music, generic titles – “Sonata in C” – are more the rule than the exception. Without enough other information in the record, such as composer and publisher, date, thematic index, and so on, the title retrieval would lead to huge query results.
Distinguish: the medium matters. For music, matching needs to discriminate among:
- Same musical recording in different forms (dvd vs. blu-ray, phonograph vs. tape cassette, and so on).
- Same music with different performers/cast lists. Examples are shown in the Inexact Match Decisions section. Distinguishing performances is distinguishing expressions of the same work.
- Different publishers of the same music. See section on Herbert L. Clarke.
- Different performances by the same performing group on different dates. That is, a bibliographic record created for a performance of the same music by the same group in the same place with the same publisher, may differ only in the date or time of performance. The two performances require two records and should not be merged.
- Different forms of scores for differing instruments.
- Different scores for differing voices. High voice, medium, low, need to be distinguished by the software.
Compare, in an organized way: The comparison point framework in matching includes obvious points such as title, publisher, date, place of publication, type of material. For most types of matching, all comparison points need to match either literally or substantively (in the judgment of the rule-based software). Literal matches are not so common.
There are also specific comparison points applicable only to records for types of music. These include part, designation, publisher number, and instrument.
- Part, for example, allows the software to distinguish score vs score and parts.
- Designation discriminates among types of scores, e.g. condensed vs. close vs. vocal scores, which might otherwise match.
- Publisher number compares the numbers assigned by music publishers. Match and mismatch are tricky to apply.
- Instrument comparison distinguishes different families of instruments, and clusters related types. A keyboard cluster would contain instruments such as piano, celeste, organ, clavichord, and harpsichord.
Ignore: Non substantive or trivial differences due to variations in free text description, or cataloger interpretation. And what is trivial? This is challenging. Matching employs all the lookup tables and equivalence lists possible to ensure that terms like “vol.”, “volume”, “bd.” and “band”, or “GPO” vs “Government Printing Office”, are known to represent equivalents. That's the easier part. Tougher cases are illustrated in the next section.
Make allowances: A system of tolerance is needed to handle data errors. If all records had to be exactly right, little would be added to the database. Data errors may include categories such as the following. These are only a few of the potential cases.
- Miscodings of MARC fixed fields which can seriously mislead software, for example as to the real format or language of the item being described.
- Edition statements which are not really meaningful. One example is the use of printings information as though it is an edition, a common practice of publishers in some countries.
- Publisher number errors. While publisher numbers for scores and sound records can be significant for matching, practices for input of the numbers have varied over the years. There is not real standard for the numbers, and the same number can be entered in different ways. In cases such as rereleases of recordings, the numbers of the previously released version may be found in the record for the newer release, which has potential to confuse matching.
- Information hidden in the “wrong” MARC variable fields can blur important distinctions. If edition information is hidden in a free form, free text Notes field, it can be difficult for the matching software to find.
- Misinformation and bias are inherent risks for matching in a very large heterogeneous database. The problem is discussed in (Thornburg and Oskins, 2007).
Cluster: Matching generally wants to find all the “good” matches and return those records to the calling program. When matching is used to cluster similar records (as in GLIMIR), this goal is expanded. Good matches are more flexibly defined. Matching wants to bring together reasonably similar records into a grouping for display by an end user application. The rules of discrimination among match candidates are applied quite differently. Reprints or reproductions of an original would be brought together, for instance. Examples and distinctions specific to music records are noted next.
Strategies and challenges for matching music
With such varying goals, how can matching possibly work? The suite of software created for all the matching architectures, including GLIMIR, employs a few general strategies:
- Employ metatype to constrain the match universe. Metatype is a “world view” guiding an instantiation of a matching architecture. That is, matching knows throughout a match session that its type of matching is Deduping, Discovery/Reference, Cluster, or something else.
- Modify the query builder to customize for different forms of matching. This is labor intensive and risky. Query building can take into account known variants of query terms it mines, such as variants encountered regularly transliterated Chinese syllables. Language information is not always definitive, as a high number of records are coded “und” or undetermined or unknown. In this case the matching software can try to infer the language to generate more flexible queries. In fact this work is in progress.
- Soften the “sudden death” comparison point checking: with context-checking, matching can override some points of mismatch. Comparison points are modular and intended to make independent judgments. However, context sensitive information sharing can resolve a match decision when comparison points are allowed to inform each other. For instance a weak place of publication match could be acceptable if the publisher is present in both records and is found to match.
- Allow for varying practices in recording titles. For German records, it has proven essential to matching to exercise some understanding of common abbreviations of very long words in titles and other areas. The team has attempted to identify common patterns and the software has had improved success in matching these records. For music however, small differences in titles have to be distinguished. The n-grams software used in matching tries to guess how “similar” two title strings are overall, based on the number of triplets in common (see Elmagarmid (1977) for a survey of similarity measures in duplicate detection). This sort of technique allows for transposed letters and multiple typographical errors and can be helpful in matching. Yet this technique must be thwarted before it blindly matches “Symphony No. 8 in C” to “Symphony No. 9 in C”.
The case of Herbert L. Clarke
Herbert L. Clarke was a cornet player, composer, and bandmaster, and is considered to be one of the greatest cornet soloists ever. Two actual publishers' versions of one of his works are described in Table I. Even having the items in hand, as the authors did, distinguishing what really matters in the date and publisher information is complex. The contents of the published items appeared to be identical. Imagine the challenge to GLIMIR matching to sort this out.
The authors searched WorldCat for records that seemed to represent these two items. Two records, OCLC numbers 45258089 and 13068832, were used as seed records to feed the GLIMIR clustering process to see what else might be brought together.
The GLIMIR results showed that work remains in the clustering of music records. The first record the software did cluster three matches, but what is interesting is that the not-quite matched records included four cases in which date was the only mismatch point. This sort of case may merit investigation or even tuning of the dates rules for clustering in music. In addition, the weighting of publisher comparison in the matching results makes this case an interesting contrast of regular matching and GLIMIR matching.
Inexact match cases and the matching software decisions
The cast list is a recent addition to matching. The test set for Bernstein's Mass was a quick education in the necessity for cast list comparison, and in its intricacies. A few examples are shown in Table II. Bear in mind, the practices of recording cast list information vary widely. Illustrations are from the test results encountered with records from the XWC database.
The principles are simple, and the risk is acknowledged. The software requires that at least one performer name in common be found. The hard work is in the correct mining of numerous fields for the free texts versions of something the software can identify as a name, and as a performer.
Expressions of a work
There are multiple dimensions to the design of good clusters in music. Consider Handel's Messiah. One arrangement/score was created by Handel, another created by Mozart in German.
When differing groups perform Messiah, does the algorithm cluster first by arrangement and then by cast? For some users of the cluster, does the language signify more than the arrangement?
Another challenge of a set like Handel's Messiah is the size of the FRBR work set. The software searches on the FRBR work id, to bring in more candidates for GLIMIR matching.
However, for a set like the Messiah, searching on the work id “25204” retrieves a huge number of records. These include items that should definitely be in different GLIMIR clusters, because GLIMIR is more specific than the level of a FRBR work. The challenge to the matching software would be to find a way to build queries which are narrower than FRBR work set, but use the work id as a term in the query built.
Puccini – Il Trittico (three one-act operas)
In this case even the liberal GLIMIR matching needs to make distinctions more specific than differing cast lists. Recording date checks use simpler software logic than cast list comparisons, so it can be more effective than cast list checking. In the following matching software comparisons a cast list match is contradicted by a recording date mismatch. This raises questions as to the desirability for GLIMIR of grouping records that match in all respects except recording date. The examples shown in Table III, from Il Trittico, and one other work, illustrate matching decisions.
Conclusions, and future work
As WorldCat grows in international coverage, the matching team is working on enhancements to language inference and improved query formulation. At the same time, the team is exploring use of authority control information in new ways. In some respects this could enhance GLIMIR clustering. The Messiah examples illustrate the need for further enhancements in query building.
Improvements to the known families of instruments are desirable, as are experiments with music media such as MIDI files. Cast list enhancements to recognize key elements such as director or conductor are under consideration.
Design explorations for enhancements to improve the GLIMIR clustering of cases like Clarke are possible. There is also the matter of dealing with aggregate works in music: how does one cluster a record for a sound recording containing more than one musical work?
Consideration of how work in GLIMIR can inform the FRBR implementation is needed. GLIMIR clusters which match across work sets will result in merging of FRBR work IDs in WorldCat, which will enhance the FRBRization of WorldCat. FRBR work ids are also used to inform matching, so as to make indirect use of the authority work used in generation of the FRBR ID. (That is, matching might retrieve candidates via work id which could not be retrieved by author or title of a given record.) Whether GLIMIR clustering results should also cause the splitting of a work set has not yet been determined.
Matching of bibliographic records in non-Latin scripts is an area with lots of work yet to be done. What it would take to match records for music with key information expressed in Hebrew or Arabic or Cyrillic is interesting to contemplate.
It should be evident from the discussion above that the rules and techniques for matching are never really complete. A new national library joins the consortium, a change in cataloging standards is announced, a new system decides to use the matching architecture. Matching has to respond, to grow and change, and learn.
ALA (2002), Anglo-American Cataloguing Rules, 2nd rev. ed., American Library Association, Chicago, IL, .
Deutsche Nationalbibliothek (2006), “Regeln für die alphabetische Katalogisierung”, available at: www.d-nb.de/standardisierung/pdf/rak_4_erg.pdf (accessed 22 January, 2011), .
Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S. (1977), "Duplicate record detection: a survey", IEEE Transactions on Knowledge and Data Engineering, Vol. 19 No.1, pp.1-16.
Gatenby, J. (2010), “GLIMIR: its potential impact”, unpublished Powerpoint slide presentation, February, .
Greene, R. (2010), “Cataloging alchemy: making your data work harder”, video presented at the ALA Annual Meeting, June 28, available at: http://vidego.multicastmedia.com/player.php?p=ntst323q (accessed January 22, 2011), .
Library of Congress (2010), “MARC 21 concise format for bibliographic data (October 2010)”, available at: www.loc.gov/marc/bibliographic (accessed September 28, 2010), .
Naumann, F., Herschel, M. (2010), An Introduction to Duplicate Detection, Morgan & Claypool, San Rafael, CA, .
O'Neill, E.T., Rogers, S.A., Oskins, W.O. (1993), "Characteristics of duplicate records in OCLC's online union catalog", Library Resources and Technical Services, Vol. 37 No.1, pp.59-71.
Thornburg, G. (2005), “Matching: discrimination, misinformation, and sudden death”, paper presented at Informing Science Conference (INSITE), Flagstaff, AZ, June, .
Thornburg, G., Oskins, W.O. (2007), "Misinformation and bias in metadata processing: matching in large databases", Information Technology and Libraries, Vol. 26 No.2, pp.15-22.
Uitdenbogerd, A., Zobel, J. (1999), “Melodic matching techniques for large music databases, MULTIMEDIA 99”, Proceedings of the seventh ACM international conference of Multimedia (Part 1), ACM, New York, NY, pp. 57-66, .
Vellucci, S.L. (2007), in Taylor, A.G. (Eds),Understanding FRBR: What It Is and How It Will Affect Our Retrieval Tools, Libraries Unlimited, Westport, CT, pp.131-51.
Zhang, Y., Salaba, A. (2009), Implementing FRBR in Libraries; Key Issues and Future Directions, Neal-Schumann Publishers, New York, NY, .
About the authors
Gail Thornburg has taught at the University of Maryland and elsewhere, and is now a Senior-level Developer and Researcher at OCLC. Gail Thornburg is the corresponding author and can be contacted at: email@example.com
W. Michael Oskins has worked as a Developer and Researcher at OCLC for over 20 years.