Addressing the low indexing ratios of IRs in Google Scholar
Kenning Arlitsch, J. Willard Marriott Library, University of Utah, Salt Lake City, Utah, USA
Patrick S. O'Brien, J. Willard Marriott Library, University of Utah, Salt Lake City, Utah, USA
The authors would like to thank Dr Awesome for her expertise, edits, and unflagging support.
Purpose – Google Scholar has difficulty indexing the contents of institutional repositories, and the authors hypothesize the reason is that most repositories use Dublin Core, which cannot express bibliographic citation information adequately for academic papers. Google Scholar makes specific recommendations for repositories, including the use of publishing industry metadata schemas over Dublin Core. This paper aims to test a theory that transforming metadata schemas in institutional repositories will lead to increased indexing by Google Scholar.
Design/methodology/approach – The authors conducted two surveys of institutional and disciplinary repositories across the USA, using different methodologies. They also conducted three pilot projects that transformed the metadata of a subset of papers from USpace, the University of Utah's institutional repository, and examined the results of Google Scholar's explicit harvests.
Findings – Repositories that use GS recommended metadata schemas and express them in HTML meta tags experienced significantly higher indexing ratios. The ease with which search engine crawlers can navigate a repository also seems to affect indexing ratio. The second and third metadata transformation pilot projects at Utah were successful, ultimately achieving an indexing ratio of greater than 90 percent.
Research limitations/implications – The second survey is limited to 40 titles from each of seven repositories, for a total of 280 titles. A larger survey that covers more repositories may be useful.
Practical implications – Institutional repositories are achieving significant mass, and the rate of author citations from those repositories may affect university rankings. Lack of visibility in Google Scholar, however, will limit the ability of IRs to play a more significant role in those citation rates.
Social implications – Transforming metadata can be a difficult and tedious process. The Institute of Museum and Library Services has recently awarded a National Leadership Grant to the University of Utah to continue SEO research with its partner, OCLC Inc., and to develop a toolkit that will include automated transformation mechanisms.
Originality/value – Little or no research has been published about improving the indexing ratio of institutional repositories in Google Scholar. The authors believe that they are the first to address the possibility of transforming IR metadata to improve indexing ratios in Google Scholar.
Search engines; Digital libraries; Google Scholar; Institutional repositories; Search engine optimization; Metadata.
Library Hi Tech
Emerald Group Publishing Limited
Search engine optimization (SEO) research conducted at the University of Utah has revealed that many institutional repositories (IRs) have a low indexing ratio in Google Scholar (GS). IRs were developed to manage and ensure long-term access to academic publications, and GS was created as a search engine for those publications, whether they reside in IRs, at publisher repositories, or other research-oriented sites. This paper addresses the reasons for the low indexing ratio of many IRs, which the authors believe stem mainly from the metadata requirements of GS and which stand in contrast from the practices of many IRs. The authors conducted two surveys of IRs across the country and implemented three pilot projects designed to increase the indexing ratio of the IR at Utah. Transforming the metadata schema in those pilot projects led to a significant improvement in GS indexing ratio of the sample set. Additional reasons for the low indexing ratio of IRs can be tied to the ease with which GS's crawlers can navigate a given repository.
While much has been written about search engine optimization for general websites, very little has been published about SEO specifically for digital repositories and even less for institutional or disciplinary repositories. The subject of this paper developed from more general digital repository SEO research the authors have conducted at the University of Utah's J. Willard Marriott Library for the past 18 months, and whose continued research has recently been funded by a National Leadership Grant from the Institute of Museum and Library Services (IMLS). OCLC is a formal partner on this grant. The authors have dramatically improved the indexing ratio of Utah's digital library (including IR) in Google's main index, and some of that work will be discussed as background.
Digital repositories relevant to this article are defined as databases that store digitized or born-digital objects, making them freely accessible to the public. These repositories typically run on web server technologies, use descriptive metadata, and because their missions are generally directed at open access they all benefit from successful harvesting and indexing by internet search engines. IRs are that subset of digital repositories that capture and manage the intellectual output of academic institutions or disciplines.
The USpace IR at the University of Utah currently experiences an indexing ratio of less than 0.1 percent in GS, even though the indexing ratio for the same repository improved from approximately 18 to 98 percent in Google's main index after numerous SEO problems were addressed. The authors hypothesized that the average indexing ratio in GS of IRs across the USA is low, and that altering repository metadata to follow one of the publishing industry schemas recommended by Google Scholar would lead to a substantial improvement in the indexing ratio of USpace content.
The authors conducted two surveys to identify the indexing ratio of other IRs, and the methodologies used for both are explained more fully in the survey methodologies section. In brief, the authors selected repositories from the OpenDOAR directory (University of Nottingham, 2011), gathered total item numbers from each repository, and then systematically sampled GS for evidence that those items had been included in its index. Two surveys (survey 1 and survey 2) were conducted, each using a substantially different searching methodology.
To test improvements to the USpace IR the authors gathered indexing ratio statistics and created a feedback loop to measure results of changes they made to repository items, demonstrated in three pilot studies. They submitted sitemaps containing URLs for all items in USpace to GS, and then used Google Webmaster Tools (Google, 2011) to observe the activity of crawlers and the resulting indexing ratios. The authors adjusted the metadata for a subset of repository articles, and following a re-harvest they gathered additional statistics (pilot 1). After this approach failed, the authors conducted discussions with OCLC and GS to confirm a new approach, which led to the development of metadata templates for various academic paper types whose effectiveness was tested during explicit harvests by GS (pilot 2 and 3).
Google and Google Scholar
Google and Google Scholar are separate indexes, and GS has a different focus from its much larger parent. Dr Anurag Acharya, GS's founding engineer, has stated that the goal is to offer the “most comprehensive list of research papers available on the Web,” and that GS limits its results to “peer reviewed papers, theses, books, abstracts, and technical reports” (Assisi, 2005). More recently, GS has added patents and legal cases to the items it indexes.
GS has its own crawlers (also known as spiders or robots) that visit repositories and publisher sites, among others, to harvest content appropriate for its index. A peculiarity of GS's presentation of academic papers is that it generally provides a link directly to the PDF document. This is expedient for users as it gets them directly to the content, but it also strips any context that may have been provided by the repository's HTML display. In other words, metadata, institutional logos, and other information normally displayed to users are lost unless they are inserted into the PDF itself. The practice can also affect the reporting of visitation statistics through website analytics software that utilize page tagging. For instance, Google Analytics requires a tracking code inserted in the HTML of each page of a given website to gather statistics. Each time that page is displayed in a web browser it is counted as a visit by Google Analytics, but separating the PDF file from the HTML display means that the visit will not be counted because the required tracking code is not executed when the PDF is called directly. This problem can be overcome by having the webserver execute a PHP script containing the tracking code before serving the requested PDF, but it is unlikely that many repository managers are doing this, or are even aware that their visitation and download statistics may be underreported as a result of GS's item display practice.
Internet search engines dominate general information-seeking behavior of users, and Google is by far the most popular search engine, consistently grabbing 65 percent share of the “explicit core search market” (Comscore, 2011). Bing powers Microsoft and Yahoo! search sites, capturing another 30 percent of market share. The dominance of search engines is also apparent in the academic sector. A 2005 survey by OCLC demonstrated that 89 percent of college students began their research with internet search engines, and that only 2 percent began at library websites (DeRosa and OCLC, 2005). A repeat of that survey five years later demonstrated that the situation for libraries had only worsened, as 0 percent of respondents reported visiting library websites at the outset of their research (DeRosa et al., 2010). That same report saw a slight drop in traditional search engine use, but also noted for the first time the use of social media search engines for initial research. Another 2005 survey in the UK found that “students prefer to locate information or resources via a search engine above all options, and Google is the search engine of choice” (Griffiths and Brophy, 2005). The information-seeking behaviors of young academic researchers in Sweden displayed an “almost complete dominance of Google as a starting point for searching scientific information” (Haglund and Olsson, 2008).
Faculty search behavior is similar. A study of active faculty researchers at four major universities reported that “researchers find Google and Google Scholar to be amazingly effective” for their information retrieval needs and accept the results as “good enough in many cases” (Kroll and Forsman, 2010). Rieger reports a high degree of use and satisfaction with internet search engines. She notes that “both faculty and students prefer search engines over other resources to support their academic work” and that “there is a broader awareness of specialized Google tools such as Google Scholar and Google Book among faculty members and graduate students” (Rieger, 2009). In a comparison of GS to Web of Science, Mikki states that “the amount of qualified scholarly content has increased considerably in Google Scholar since it was launched in 2004,” and that it has developed into a serious research and citation study tool that should be included in information literacy programs (Mikki, 2009).
A review of the literature pertaining to SEO in libraries reveals that much of the published research deals with general websites (e.g. Cahill and Chalut, 2009; Rushton et al., 2008). The minimal research dealing with digital repositories sometimes concludes by suggesting that content be replicated outside the database in a static format in order to make it friendlier to search engines, a method that seems arcane and burdensome, but may have been the best option at the time. “Unless links are located on a static web page, crawlers won't find them, and many such links are not followed” (DeRidder, 2008). Page rank in search engines is another factor that plays into repository visibility. Malaga has shown that 62 percent of users click only on results that appear in the first search engine results page (Malaga, 2008). The high use of internet search engines as primary search mechanisms suggests that digital repositories created by libraries are likely to be nearly invisible to users if their contents are not indexed in these search engines.
Search engine and metadata optimization for institutional repositories are also addressed only minimally in the published literature, and the value and use of GS is sometimes questioned. McKay offers that “authors are quite right in perceiving [IRs] as ‘islands of information,’ […] a condition that can be addressed by search-engine harvesting […]” She goes on to say “Google Scholar is not usually the first information source” consulted by academics, though that may have been truer when the article was published in 2007, when GS was relatively new and contained much less content (McKay, 2007). Increased use of GS is demonstrated in a more recent University of Mississippi study in which use rose from 4 to 27 percent of major library databases over a four-year period (Herrera, 2010). A 2006 article on optimizing metadata for search engines acknowledges “the problem may not lie with the search engines but with the data providers,” and introduces the concept of “data shoogling” to offer more Google-friendly metadata in digital collections (Dawson and Hamilton, 2006). It does not, however, specifically address institutional repositories or GS. A survey of 540 librarians at 108 ARL libraries notes complaints of “inadequate use of metadata by Google Scholar” (Drewry, 2007), which may support this paper's hypothesis that GS does not find metadata supplied by libraries to be appropriately structured or unique.
Beel, Gipp, and Wilde offer related and significant strategies for optimizing academic papers themselves for better inclusion in search engines, and in GS in particular. Their advice includes optimizing graphics for indexing purposes, writing relevant document titles, and selecting appropriate keywords (Beel et al., 2010). While optimization of the academic papers themselves merits continued exploration and testing, it is beyond the scope of this article.
Background on SEO research for digital repositories at the University of Utah
Digital repositories of every type face a common challenge: having their content found by interested users in a crowded sea of information on the internet. Getting found means the repository items must be included in the indexes of major search engines, because that is where the vast majority of users start looking. Unfortunately, many digital repositories show poorly in the results from major search engines. In 2010 the authors conducted a survey of 650 known objects across the thirteen repositories of the Mountain West Digital Library (MWDL), and revealed a disturbing pattern: only 38 percent of digital objects searched by title were found in Google's index. Worse, this Google search engine results page (SERP) consisted mostly of links back to a search results screen in the local repository, rather than linking directly to the objects. Only 15 percent of the hits on the SERP provided users with direct links to the objects. The known-item title searching method employed by the survey probably produced the best results possible at the time; searching by keyword or subject term would likely have presented even fewer items from the repositories of the libraries and archives in the MWDL.
- conflicts between sitemaps and robots.txt files;
- slow server response time;
- dead links or failure to provide appropriate redirects;
- poor application of metadata, including re-use of the same metadata terms for multiple objects; and
- metadata schemas deemed unacceptable by the specific search engine.
Additional challenges with search engine optimization for digital repositories may be framed as administrative. These include:
- aligning the goals of the digital library with institutional goals;
- informing, training, motivating, and coordinating staff from various departments;
- establishing an environment of continuous monitoring and addressing crawler errors as they arise; and
- institutionalizing tools to analyze metrics, and using them to inform and convince stakeholders of the impact of the digital library.
Results at Utah
Figure 1 shows increases in average indexing ratios for all digital collections over a period of 15 months. It also shows improvements in the highest indexing ratio achieved for collections with more than 500 URLs.
Increased indexing ratios have thus far led to a 200 percent increase in referrals from Google, and an 80 percent increase in visits to all digital collections. Indexing ratios of USpace, the University of Utah's IR, have also increased from approximately 18 percent to 98 percent, but only in Google, not in Google Scholar (see Figure 2).
Open access and institutional repositories
The open access movement was launched to improve access to publicly funded research, and to help libraries deal with rampant inflation in journal subscription prices. According to Peter Suber the open access movement is dependent on internet technologies and the consent of the author or copyright holder (Suber, 2004). Institutional repositories were one product of this movement; they capture the intellectual output of the faculty, staff, and students of universities or academic disciplines, and assure perpetual and free access to that output (barring embargo periods or other publisher restrictions). IRs often include electronic theses and dissertations (ETD), and most are managed by academic libraries and some by scholarly societies. Over the past decade IRs have variously enjoyed advances and suffered setbacks, but through the consistent work of many individuals at numerous institutions they are achieving enough mass to become viable sources of research publications. They also hold the promise of contributing significantly to author citation rates. Recent research in the UK suggests that institutional repositories may play a crucial role in measuring research output, and in turn may affect university rankings (Key Perspectives and Brown, 2009). The Times Higher Education publishes an annual ranking of the top world universities, and research citations contribute 32.5 percent toward each university's score (The Times Higher Education, 2010).
Libraries have not developed a mechanism to aggregate and search IRs, and thus GS has become the best de facto search engine available for IR content. But just as institutional repositories are gaining enough mass to make them useful and credible sources of research output, the difficulties associated with SEO threaten to undermine their potential. Faculty and other authors who contribute publications to IRs may lose interest if their publications can't be located (and cited) in academically-oriented search engines like GS.
Surveys of IR indexing ratios in Google Scholar
In October and December 2011 the authors conducted two surveys of institutional and disciplinary repositories to arrive at a preliminary determination of how well GS was indexing them. The IRs were identified through the Directory of Open Access Repositories, also known as OpenDOAR (University of Nottingham, 2011).
Only institutional or disciplinary repositories housed in the USA were selected for these surveys. They were chosen for their academic content, and to represent an approximate real-world distribution of several repository software types: DSpace, Digital Commons, EPrints, IR+, CONTENTdm, DigiTool, and arXiv (see Table I). While there are a number of other software types in use, many of them are not found in the U.S. Some repositories found in OpenDOAR were ruled out because it was immediately obvious that they included other types of non-IR digital collections, such as photographs. According to OpenDOAR the arXiv repository software is used only by arXiv, but it was included in Survey 1 because of its size and importance to the scientific community (see Table I for a complete listing of the repositories selected for survey 1).
Search engine indexing is a dynamic environment. Crawlers return to repositories periodically to pick up new additions, sometimes discarding items if they run into errors, and the repositories themselves are (hopefully) continually growing. Therefore these surveys should be understood to be a snapshot from a specific moment in time.
OpenDOAR records list the number of items in most repositories, but those figures are usually outdated. The authors determined the current number of repository items from figures available on the sites themselves, and in one case by contacting the repository manager. DSpace repositories make it is easy to browse by title to reveal all the items in the repository. In the case of digital commons, a dynamic script posts the current total items in the repository. EPrints repositories had a page that listed the number of items by type, the sum of which represented all the items in the repository. Other sites offered similar methods of determining the total number of items contained in the repository.
Survey 1 Methodology
In the first survey searches were conducted to determine the number of items indexed by GS from a given repository by using the “site” operator, i.e. search queries in GS were structured in the following manner: “site:repositoryURL.” This operator must be used with caution, because in GS it only searches the primary versions of academic papers. In other words, a paper that has been formally published in a journal will be considered the primary version. Additional versions of that paper, including those that appear in IRs may be indexed by GS, but are considered other versions and will only be revealed by clicking the “versions” link (see Figure 3). Because the other versions do not appear on the initial search results page, it is incorrect to assume that the number of results of a search using the site operator shows all the items that GS has indexed from that repository.
The data from this survey confirm a low average primary publication indexing ratio of only 30 percent (see Table I and Figure 4). Being mindful of casting aspersions, the authors are fully aware that their own IR (USpace) currently shows a near zero percent indexing ratio in GS.
Survey 2 Methodology
In the second survey the authors used a similar approach to the one they had employed in 2010 to survey the repositories of the Mountain West Digital Library, i.e. they searched in GS for known repository items by their titles. This method is, of course, slower and more laborious, but it is also more accurate, allowing articles to be counted whether they appear as the primary link in the initial list of results or are hidden behind the “versions” link.
The authors created a data set for seven repositories from survey 1 by using crawler software to harvest titles from each repository. This method mimicked the process used by internet search engine crawlers, and collected 500 to 1,400 article titles from each repository and saved them into Excel spreadsheets. In some cases, scholarly papers in the IRs were easy to identify and entire collections could be crawled. In other cases, it was difficult to isolate the publicly available scholarly papers for an automated crawler because the repositories do not follow the GS recommendations. These difficulties in crawling the IR resulted in less than optimal sampling of titles within the IR collection; in fact the sample may have been biased to favor a higher indexing ratio because the authors made efforts to harvest only academic papers. Using a sampling methodology developed for verifying database backups (LaRock, 2010) those titles were then randomized, and forty titles from each set were searched by copying article titles from the spreadsheets and pasting them into the GS search box. The authors used Zotero to create metadata records and snapshots for each search result, whether the article was found or not. “Versions” links were followed whenever found and the resulting screen was also captured as a snapshot attached to the same metadata record in Zotero.
Of the seven repositories that were sampled, three showed very high indexing ratios (88-98 percent), while the other four showed ratios below 50 percent (see Table II). A discussion about the likely reasons for these differences follows in the section titled “Survey Conclusions.”
The first survey had limitations in terms of calculating a complete index ratio for each IR. However, since use of the site operator in GS reveals only the primary versions of the articles, the average indexing ratio of 30 percent indicates that most IRs do not contain very many primary articles. This raises some interesting questions about the purpose of IRs. Specifically, how much value is really derived from having pre-prints in the IR, given the amount of labor required to put them there, particularly if the primary publisher is open access as well? On the other hand, IRs that largely contain grey literature that is not published elsewhere will likely see a higher indexing ratio with GS precisely because those are the primary articles.
Data from the second survey are much more interesting. Because the authors used crawler software to harvest article titles, they encountered many of the same problems that Internet search engine crawlers face when trying to harvest institutional repositories. The crawling and indexing guidelines shown in Table II were drawn from stated requirements and recommendations from GS's Webmaster Inclusion Guidelines website (Google Scholar, 2010). In general, IRs that followed these guidelines had a much higher indexing ratio (88-98 percent) than sites that did not (38-48 percent). For the purposes of this paper, the most validating differences were found in the expression of publisher metadata schemas (Bepress, Highwire Press, PRISM, or Eprints) in the meta tags within the header tags of the HTML display pages (see Figure 5). Those repositories that did not make their metadata available in one of the recommended publisher schemas within the HTML meta tags generally fared much more poorly than those that did. Further, the repositories that offered absolute URLs to the PDF files for their documents also had far higher indexing ratios than those that did not. Finally, improving crawler efficiency by providing chronological listings of papers, recently added papers, and a limited number of clicks to publicly available scholarly papers also seemed to positively affect indexing ratio.
GS makes specific recommendations for IR software on its Inclusion Guidelines for Webmasters site (see reference below), but the surveys in this paper demonstrate that software makes little or no difference; the problem cuts across institutions, repository focus, and repository software. Instead, indexing ratio success has much more to do with how carefully a repository follows the guidelines described, above:
If you're a university repository, we recommend that you use the latest version of Eprints (eprints.org), Digital Commons (digitalcommons.bepress.com), or DSpace (dspace.org) software to host your papers. If you use a less common hosting product or service, or an older version of these, please read the rest of this document and make sure that your website meets our technical guidelines (Google Scholar, 2010).
Why Google Scholar has difficulty with institutional repositories
Librarians are great believers in standards, and while building digital repositories they have dutifully followed them for scanning, metadata creation, harvesting, and web services, among others. Search engines, however, are not required to honor standards. For example, in August 2008 Google announced that it was “Retiring support for OAI-PMH in Sitemaps” (Mueller, 2008), causing consternation across the library community. Two years later, GS made the following announcement on its Webmaster inclusion guidelines site: “Use Dublin Core tags (e.g. DC.title) as a last resort – they work poorly for journal papers […]” (Google Scholar, 2010).
Although Dublin Core is recognized to be a standard of the lowest common denominator, libraries have used it widely for most digital repositories, including IRs. The Dublin Core schema works “poorly for journal papers” because it does not include adequate fields for citation data and because it is interpreted inconsistently. Citation information such as journal name, volume and issue number, and page numbers span of the article is usually entered into a single field, such as DC.Relation or DC.Source in simple Dublin Core, and there is no specified format or consistency. This makes it difficult for a search engine like GS to accurately parse and index the data into their individual bibliographic components. The Dublin Core Metadata Initiative website (DCMI, 2005) does include guidelines for encoding bibliographic citation information using a qualification of the DC.Identifier field (called “bibliographicCitation”) but this is still only a single field. It is also unlikely that many repositories have updated to reflect the relatively recent development of DC Qualifiers. Dublin Core also does not facilitate various academic paper types: there is no specific field to distinguish a pre-print from a journal article, a book chapter from a book, a working paper from a conference proceeding, or a dissertation. In short, libraries are not focusing enough on making metadata machine-readable.
Instead of Dublin Core GS recommends using one of the following schemas: Highwire Press, Eprints, Bepress, and PRISM. These schemas are more adept at structuring citation data appropriately. Highwire Press, a division of Stanford University, developed its schema for journal articles and GS extended the tags to cover additional academic paper types, such as working papers, dissertations, manuscripts, conference papers, books and book chapters. The authors used the extended Highwire Press tags in their pilot projects to test the hypothesis that transforming metadata would lead to an increase in indexing ratio in GS for an IR.
Due to the USpace's non-existent showing in GS, the authors began to strategize methods to modify USpace metadata to fit the recommendations. GS explains how Highwire Press tags could map to Dublin Core fields (Google Scholar, 2010). Thus the first step was to begin aligning existing Dublin Core fields with those mappings (see Table III).
The indexing ratio for USpace at the University of Utah prior to the pilot (July 5, 2010) was poor, at best, and can be summarized as follows:
- Index ratio for the three primary USpace IR collections containing 6,482 papers:
- ranged between 4 percent and 23 percent within Google;
- average overall Google Index Ratio was 18.33 percent (1,188/6,482); and
- index ratio within GS was less than 0.1 percent.
The following steps were taken to address the poor indexing ratio:
- Sitemaps representing three IR collections were submitted through Google Webmaster Tools:
- A total of 6,482 URLs were submitted:
- – Each collection contained between 500 and 4,200 academic papers.
- Errors generated during Google crawls were analyzed using Webmaster Tools and improvements were made:
- improved server performance;
- implemented unique title and description tags containing the paper's name and abstract, respectively; and
- implemented “rel=canonical” tags, indicating the preferred URL of each digital object (there were often multiple URLs pointing to each paper).
To address the metadata requirements per the Google Scholar inclusion guidelines the authors did the following:
- Mapped Dublin Core to Google-supported Highwire Press tags:
- Extended Dublin Core fields according to GS recommendations:
- – journal volume (DC.volume);
- – journal issue (DC.issue);
- – starting page number (DC.citation.spage); and
- – ending page number (DC.citation.epage).
A total of 20 papers were selected for a pilot:
- Verified metadata was accurate and mapped correctly to the HTML “meta name=” fields on display templates as understood from GS inclusion guidelines (see Table III and Figure 6).
- Ensured each of the 20 papers had a full-text PDF that met GS inclusion guideline requirements.
- Embedded the metadata schema directly into five of the PDF files of the papers.
- Provided a “landing page” per GS inclusion guidelines, containing links to the 20 IR pilot papers that was within a few clicks of the home page. This landing page contained links to both a paper's HTML page and its full-text PDF.
The experiment delivered a significant increase in the Google index ratio for the IR collections (see Figure 2), and as of October 16, 2011 the Google Index ratio for the IR collections was 97.82 percent (10,306/10,536). However there was no effect on the IR's GS index ratio. In fact, not one of twenty USpace papers that had been isolated and optimized was included in the GS index.
During the summer of 2011 the authors consulted with OCLC and Google Scholar with the aim of developing and testing a second pilot project. Nineteen papers from USpace were selected for the second pilot:
- Six of seven GS paper types were represented and the full text PDF document was included for each paper. The book paper type was out of scope for this pilot (see Appendix for examples of each paper type):
- dissertation and thesis;
- conference article;
- working paper;
- manuscript and pre-print;
- journal article; and
- book chapter.
- CONTENTdm v6.0 display templates were augmented:
- embedded Highwire Press meta tags in the HTML page header of display templates using an automated script (see Figure 7);
- created a browse by year page that provided links to papers in chronological order of publishing date; and
- created a recently added page that listed papers added to the IR within the last 30 days.
The second pilot was a moderate success, with 62 percent of papers indexed on the first harvest. However, due to unexpected campus network and power outages that took down the test server for an extended period, the pilot was cut short and the results were dropped from GS's index.
For the third and final pilot project, the authors uploaded 56 papers with full-text PDF files, and transformed the Dublin Core metadata to Highwire Press tags as described earlier. The same six paper types were represented as before. This time more than 90 percent appeared in the GS index after four weeks. Continuing conversations with GS and OCLC will help address lingering issues, but the authors consider this success to be a significant breakthrough.
The thought of manually transforming metadata for an IR might induce nausea in repository managers. Fortunately, the IMLS NLG grant recently awarded to the University of Utah intends, as one of its deliverables, to help address this problem. OCLC is a partner in the grant and will develop formal crosswalks between Dublin Core and one or more of the publishing industry schemas recommended by GS. Automated transformation and linked data mechanisms will also be developed to minimize the work required to express citation data more effectively for indexing. The products of that grant will be published in a toolkit by 2014 or sooner.
Transforming metadata to GS-preferred metadata schemas is very likely to raise indexing ratio of IRs. The second and third pilot projects described in this paper were successful, demonstrating that transforming from Dublin Core metadata tags to more precise bibliographic Highwire Press tags increased the sample data set GS indexing ratio from 0 percent to 62 percent in the second pilot, and then to more than 90 percent in the third. The authors are cautiously optimistic that continuing discussions with GS and OCLC will eliminate most remaining indexing problems. Transforming metadata to EPrints, PRISM, and Bepress schemas is also likely to have a positive effect, though this assertion will require additional testing.
The low indexing ratio of IRs in GS cuts across institutions and repository software. Despite GS's endorsement of three software packages, the surveys conducted for this paper demonstrates that software is not a deciding factor for indexing ratio in GS. Each of the three recommended software packages showed good indexing ratios for some repositories and poor ratios for others. Rather, the major deciding factors seem to lie in:
- whether the IR has provided crawlers an efficient method to access its scholarly papers; and
- whether acceptable metadata schemas are provided that offer precise bibliographic information within the HTML page header tags.
While transforming metadata seems to be an effective route to getting indexed, individual IRs may have additional SEO-related problems that must be addressed as well. Slow or misconfigured servers, failure to submit viable sitemaps, crawler errors that remain unresolved, failure to provide appropriate server response codes, lack of communication across the organization, and a host of other potential problems must be considered for effective SEO that will raise repositories' visibility in all search engine indexes. Advanced methods for optimizing PDF files may also help to assure inclusion in the GS index. More research and testing is needed, but it is fair to say that a crawler-friendly repository will fare much better in GS than one that poses difficulties to crawlers. Upgrading to current repository software packages may help in this endeavor as product development teams become aware of and address SEO issues.
The growing use of GS by researchers underscores the need to address the problem of low IR indexing ratio. As the economic recession has tightened university budgets, more emphasis is being placed on assessment and measurement of outputs. IRs have the potential to raise author citation rates, and in turn to affect university rankings, but this potential may be seriously hampered if IR content is redundant or invisible to researchers who use GS.
Figure 1Google index ratio improvement for general digital collections at Utah
Figure 2Increase in USpace indexing ratios in Google
Figure 3Google Scholar search result showing link to other versions of the paper
Figure 4Survey 1 results showing indexing ratios of repository primary publications
Figure 5Example of HTML meta tags using of Bepress schema
Figure 6Converting bibliographic data
Figure 7Highwire Press tags embedded in HTML headers
Table ISurvey 1 of IRs showing primary publication version indexing ratios
Table IISurvey 2 indexing ratios for seven institutional repositories
Table IIIMap used in first GS pilot
Table AIHighwire Press metadata mappings for seven paper types
Table AIIHighwire Press metadata mappings for seven paper types
Table AIIIHighwire Press metadata mappings for seven paper types
Table AIVHighwire Press metadata mappings for seven paper types
Indexing ratio is defined here as the number of unique URLs from a given repository found in a search engine's index divided by the total number of URLs in the repository.
(Open Archives Initiative Protocol for Metadata Harvesting, a common standard for sharing metadata in the library community).
USpace added a second theses and dissertations collection after the first GS pilot was started in July, 2010.
Assisi, F.C. (2005), "Anurag Acharya helped Google's scholarly leap", INDOlink – Science & Technology, available at: www.indolink.com/SciTech/fr010305-075445.php (accessed 13 October 2011), .
Beel, J., Gipp, B., Wilde, E. (2010), "Academic search engine optimization", Journal of Scholarly Publishing, Vol. 41 No.2, pp.176-90.
Cahill, K., Chalut, R. (2009), "Optimal results: what libraries need to know about Google and search engine optimization", The Reference Librarian, Vol. 50 No.3, pp.234-47.
Comscore (2011), "comScore releases September 2011 US search engine rankings", available at: www.comscore.com/Press_Events/Press_Releases/2011/10/comScore_Releases_September_2011_U.S._Search_Engine_Rankings (accessed 22 October 2011), .
Dawson, A., Hamilton, V. (2006), "Optimising metadata to make high-value content more accessible to Google users", Journal of Documentation, Vol. 62 pp.307-27.
DCMI (2005), "Guidelines for encoding bibliographic citation information in Dublin Core metadata", Dublin Core Metadata Initiative, available at: http://dublincore.org/documents/dc-citation-guidelines/ (accessed 26 October 2011), .
DeRidder, J.L. (2008), "Googlizing a digital library", The Code4Lib Journal, available at: http://journal.code4lib.org/articles/43 (accessed 5 October 2011), No.2, .
DeRosa, C., OCLC (2005), Perceptions of Libraries and Information Resources: A Report to the OCLC Membership, OCLC Online Computer Library Center, Dublin, OH, .
DeRosa, C. (2010), Perceptions of Libraries, 2010: Context and Community, OCLC Online Computer Library Center, Inc., Dublin, OH, available at: www.oclc.org/reports/2010perceptions.htm (accessed 4 October 2011), .
Drewry, J.M. (2007), Google Scholar, Windows Live Academic Search and Beyond: A study of new tools and changing habits in ARL libraries, University of North Carolina at Chapel Hill, Chapel Hill, NC, available at: http://etd.ils.unc.edu/dspace/handle/1901/429 (accessed 21 October 2011), .
Google (2011), "Google Webmaster Central", available at: www.google.com/webmasters/ (accessed 29 October 2011), .
Google Scholar (2010), "Inclusion Guidelines for Webmasters", available at: http://scholar.google.com/intl/en/scholar/inclusion.html (accessed 4 October 2011), .
Griffiths, J.R., Brophy, P. (2005), "Student searching behavior and the web: use of academic resources and Google", Library Trends, No.Spring, pp.539-54.
Hagans, A. (2005), "High accessibility is effective search engine optimization", A List Apart, available at: www.alistapart.com/articles/accessibilityseo (accessed 4 October 2011), .
Haglund, L., Olsson, P. (2008), "The impact on university libraries of changes in information behavior among academic researchers: a multiple case study", The Journal of Academic Librarianship, Vol. 34 No.1, pp.52-9.
Herrera, G. (2010), "Google Scholar users and user behaviors: an exploratory study", College and Research Libraries, available at: http://crl.acrl.org/content/early/2010/07/23/crl-125rl.abstract (accessed 4 October 2011), .
Key Perspectives, Brown, S. (2009), "A comparative review of research assessment regimes in five countries and the role of libraries in the research assessment process: a pilot study commissioned by OCLC Research", OCLC Research, Dublin, OH, .
Kroll, S., Forsman, R. (2010), A Slice of Research Life Information Support for Research in the United States, OCLC Research, Dublin, OH, .
LaRock, T. (2010), "Statistical sampling for verifying database backups", simple-talk, available at: www.simple-talk.com/sql/database-administration/statistical-sampling-for-verifying-database-backups/ (accessed 10 December 2011), .
McKay, D. (2007), "Institutional repositories and their ‘other’ users: usability beyond authors", Ariadne, available at: www.ariadne.ac.uk/issue52/mckay/ (accessed 15 October 2011), No.52, .
Malaga, R.A. (2008), "Worst practices in search engine optimization", Communications of the ACM, Vol. 51 No.12, pp.147.
Mikki, S. (2009), "Google Scholar compared to Web of Science: a literature review", Nordic Journal of Information Literacy in Higher Education, Vol. 1 No.1, pp.41-51.
Mueller, J. (2008), "Retiring support for OAI-PMH in Sitemaps", Official Google Webmaster Central Blog, available at: http://googlewebmastercentral.blogspot.com/2008/04/retiring-support-for-oai-pmh-in.html (accessed 19 October 2011), .
OCLC (2011), "CONTENTdm Digital Collection Management Software", CONTENTdm (OCLC – Digital Collection Services), available at: www.oclc.org/contentdm/default.htm (accessed 27 October 2011), .
Rieger, O.Y. (2009), "Search engine use behavior of students and faculty: user perceptions and implications for future research", First Monday, available at: http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2716/2385 (accessed 21 October 2011), Vol. 14 No.12, .
Rushton, E.E., Kelehan, M.D., Strong, M.A. (2008), "Searching for a new way to reach patrons: a search engine optimization pilot project at Binghamton University Libraries", Journal of Web Librarianship, Vol. 2 No.4, pp.525-47.
Suber, P. (2004), "Very brief introduction to open access", available at: www.earlham.edu/∼peters/fos/brief.htm (accessed 15 October 2011), .
(The) Times Higher Education (2010), "The Times Higher Education World University Rankings 2010-2011", available at: www.timeshighereducation.co.uk/world-university-rankings/ (accessed 4 October 2011), .
University of Nottingham (2011), "OpenDOAR – Home Page – Directory of Open Access Repositories", available at: http://opendoar.org/ (accessed 12 October 2011), .
About the authors
Kenning Arlitsch is Associate Dean for IT Services at the J. Willard Marriott Library, University of Utah. He recently completed a 12-month sabbatical, during which he conducted research with OCLC on search engine optimization and network level library technologies. Mr Arlitsch began building the Marriott's digital library program in 1999, and founded the multi-state Mountain West Digital Library, the Utah Digital Newspapers program, and co-founded the Western Soundscape Archive. His department is responsible for digitization, interface design and development, ILS, repository management, and server infrastructure for the library and its extended digital programs. He holds a BA in English from Alfred University in New York, and a Master's degree in Library and Information Science from the University of Wisconsin-Milwaukee. He is also a graduate of the Frye Leadership Institute (2005) and of the Research Libraries Leadership Fellows program (2009), sponsored by the Association of Research Libraries. Kenning Arlitsch is the corresponding author and can be contacted at: firstname.lastname@example.org
Patrick S. O'Brien is an expert in customer focused, data driven sales and marketing operations. He specializes in the use of new media channels and internet marketing to increase product visibility, acquire new customers, and improve customer satisfaction. He first began incorporating search engine optimization (SEO) into demand generation marketing programs in 1997. He is a former Accenture Strategy Consultant with over 15 years' experience working with business executives on converting marketing strategy into actionable results within the pharmaceutical, biotechnology, healthcare, financial services and telecommunications industries. Mr O'Brien holds a BA in Economics from UCLA and an MBA in Marketing and Finance from The University of Chicago, Booth School of Business.