About the Digital Pilot Project

The Digital Research Library (DRL) of the University of Pittsburgh's Library System supports the teaching and research mission of the University by serving users through the creation and maintenance of Web-accessible digital research collections. The DRL has mounted several digitized text collections to disseminate English language material of historical significance to a global audience. This pilot project enabled the department to expand its scope to include foreign language material.

Pilot Project Goals & Objectives

The DRL's main purpose for partaking in the pilot project was to test whether its current array of tools for converting and delivering texts in a digital environment could be successfully applied to a small set of non-Western print material (i.e., Chinese). To reach this goal, the DRL outlined several primary objectives:

  • Perform a detailed analysis of the inherent structure of the texts selected by the East Asian Library (EAL) to guide in the creation of structural metadata and pagination;
  • Modify existing production processes to accommodate non-Western material;
  • Romanize bibliographic and structural portions of the texts into Pinyin (ASCII);
  • Convert the texts into digital image files, adhering to internationally-recognized imaging standards;
  • Create a website for the global dissemination of the texts;
  • Provide visitors to the website with the ability to:
    • View and print the digital pages;
    • Browse the texts by subject and title;
    • Navigate a hyperlinked table of contents for each text; and
    • Search the bibliographic and structural elements of the texts.

The DRL did not attempt to convert the page images into full-text (Unicode) via a Chinese character OCR engine or re-key the texts due to its present inability to offer full-text search and retrieval via the middleware.

Metadata Capture and Review

In order to facilitate the browsing, viewing and searching of the bibliographic information of the texts, the DRL acquired and created appropriate metadata. This consisted of two key components: a MARC record, and structural and descriptive information for each text. This metadata, used for creating the SGML version of the texts, was originally planned to be Romanized into Pinyin for indexing by the search engine. As it turned out, the EAL decided not to transliterate the structural headings from vernacular Chinese into Pinyin, but rather translate the structural headings into English.

The EAL sent the DRL MARC records for each of the texts. If a MARC record did not exist for a particular text, a cataloger in EAL created or revised one. MARC records were used both as an access point and bibliographic identifier (title, author and publishing information in Pinyin and subject terms in English).

A Chinese student enrolled in the graduate School of Information Science performed a summer (2004) field placement in the DRL to assist in the analysis and capture (translation) of the metadata and website design. The student intern worked with the EAL to identify and record the types of structural information present in the texts, such as front matter, title page, table of contents, list of illustrations, chapter or section headings, figures, index, and back matter. The student created detailed metadata guidelines to document the workflow and process.

The student intern systematically examined each text and entered its corresponding structural metadata (in English, not Pinyin) using a Web form to record the image filename, document type and corresponding titles, and any additional metadata. The Web form interacts with a MySQL database, and was later used in the automated creation of the SGML files.

To insure a high accuracy level of the translated structural metadata, several EAL librarians and staff proofed the metadata after its initial capture by the graduate student. Their review primarily focused on translation errors. Corrections were subsequently made by the student intern.

The capture of the structural metadata often proved to be particularly challenging. Many of the books did not incorporate traditional Western publishing practices. For example, some books did not contain printed page numbers or a table of contents. Others were divided into several parts with each part beginning its pagination sequence over again, or the first part would read from left to right while the second part would read from right to left. Some of the books contained complex foldouts (e.g., maps) without pagination clues, while other texts contained missing pages (obtained via Interlibrary Loan), duplicate pages, wrong page order (reorganized upon disbinding), or multiple photocopied "pages" on one page.

Digitization of the Texts

Rather than outsource the digitization of the texts as its normal custom, the DRL elected to scan the texts itself, and to do so from the original source (instead of the microfilmed versions) in order to better control and manage the digitization process. The DRL choose this method because of concern with properly coordinating the image pagination of the digital surrogates since many of the original texts do not contain Roman or Arabic numerals. Moreover, many of the texts necessitated scanning in "reverse" order (from right to left) since they are not read in the traditional Western practice of left to right.

The ULS Preservation Department disbound the texts for digitization by the DRL (and subsequently microfilming by Preservation Resources). The DRL digitized the texts (10,500 images) on an in-house flatbed scanner and created high-quality master images (600 dpi, 1-bit TIFF 6.0 format).

Serving the Texts Online

The DRL mounts its text collections by employing middleware from the University of Michigan’s Digital Library eXtension Service (DLXS). Access to this suite of tools, along with a licensed SGML-aware search engine (XPAT), enables the DRL to index and serve digital library content. The DRL modified its existing scripts that convert text files (in this case, only bibliographic and structural metadata) into SGML. On-the-fly GIF images are derived from the TIF images for viewing on the Web.

The DRL elected not to provide the ability to search the metadata as originally planned. Searching was curtailed due to the small quantity of texts and the potential confusion to the user when attempting to search the English translations of the metadata, rather than Pinyin. Instead the the DRL created a simple browse page to access the texts by category (Primary sources, Reference works) and topic/type.

The DRL also created links to access each text's bibliographic record found in the University of Pittsburgh’s local catalog (PITTCat) to enable the user to obtain the physcial item, or know where it is shelved.

Website Design and Dissemination

The DRL worked closely with the student intern to create the general structure, components and navigation of the website. However, the student was responsible for the website's graphic design and layout.

The website was publicly released on 3 May 2005. The EAL created a cataloging record in PITTCat for the website as a whole. Further, a new MARC record was created for each digital text surrogate (i.e., computer file) and uploaded in PITTCat and WorldCat. These steps will ensure multiple resource discovery points.