|
|
|||||||
| Full-Text | Maps | Images | Finding Aids | Census | Chronology | HSWP Catalog | |
| Full-Text Production |
||
|
|
The Historic Pittsburgh system is based on the Making of America system at the University of Michigan. For an overview of how University of Michigan implemented their system, see Making of America: Online Searching and Page Presentation at the University of Michigan. Digitization ProcessThe first step of the digitization process is to create a unique document structure spreadsheet for each book. The purpose of a spreadsheet is to classify every page of text. Such items as chapter titles and image captions are recorded as the spreadsheet is prepared. Random pencil marks are delicately erased and torn pages are mended during spreadsheet creation. The spreadsheet is proofread and then sent with the book to the vendor.The vendor disbinds the book and scans each page using the spreadsheet as a guideline. The vendor then produces a facsimile reprint of each book and returns it, along with the original book and a CD-ROM containing the scanned images of each page. Quality control is performed by comparing the original book to the facsimile reprint and its corresponding scanned images. Batch OCR (Optical Character Recognition) is performed on the scanned images in order to make the text fully searchable on the Web. OCR errors are manually corrected wherever the OCR program had difficulty in recognizing the text. Building the DatabaseAn SGML-encoded document is created from the information collected when preparing the spreadsheet, the OCR output, and bibliographic information. The SGML encoding allows access to an automatically generated table of contents as well as full-text searching. Getting the Information Ready for SearchingThe page images and SGML encoding are then put on the server. Each SGML document created from a book is put into the same file as documents created for the other books. That file is indexed using SGML-aware software. This program not only indexes the location of words, but it also collects information about which region (or element) contains the word. There are many blank pages and rotated images in the original book. The native form of the pages has been captured such that facsimile reprints can be produced. When the images are mounted on the server, scripts are run to insert a message on blank pages; "This page in the original is blank." Scripts are also run to rotate images to the correct orientation for viewing. Making the Material Accessible from the WebEvery time a user fills in a query box or clicks on a link in the Historic Pittsburgh site, a Common Gateway Interface (CGI) script takes the query from the Web form and translates it into the language of the search engine. The CGI script gathers the results from the search engine and returns them to the user in an understandable format. The CGI scripts used for the Making of America project at the University of Michigan have been modified to work with Historic Pittsburgh materials. When the user wants to view a particular page, another CGI script retrieves the correct page, sending it through a tif2gif program that converts it from a 600 dpi TIFF image to a GIF image. The image and the page navigation tools are returned to the user's browser. |