Document Type

Presentation

Abstract

The Early Modern OCR Project (eMOP) is a Mellon Foundation grant funded project whose goal is to improve optical character recognition (OCR) output for early modern printed English texts by utilizing and creating open-source tools and workflows. In addition to establishing an impressive OCR workflow infrastructure, eMOP has produced several open-source post-processing tools to evaluate and improve the text output of Google’s Tesseract OCR engine. Work on eMOP is expected to complete this summer, and the team is now looking to apply its accrued proficiency to other projects. The Austin Fanzine Project (AFP) started as a relatively straightforward digitization and transcription project, but has blossomed into a sandbox for creative experimentation with digital archives and digital humanities methods and tools.To date the project volunteers have digitized fanzines and posted the resulting downloadable files; researched digital archives best practices and crafted project policies; experimented with crowd-sourced transcription and indexing using new open-source software; and explored ways to virtually visualize the connections in real-life communities via maps, e-books, and audio tours. A current focus of the project is demonstrating ways zines can be used in digital scholarship projects to illuminate parts of the culture not documented by mainstream publications, thereby illustrating the value of investing in zine collections. At first blush, these two projects would seem to have little in common. However, many of the challenges faced by eMOP are mirrored in AFP, as are both project’s commitments to innovation, openness and crowd-sourced solutions. The printing process in the hand-press period (roughly 1475-1800), often produced texts with fluctuating baselines, mixed fonts, and varied concentrations of ink, among many other variables. Varying paper quality also commonly led to ink bleedthrough, inconsistent glyph shapes, and other problems. Combining these factors with the poor quality of the images produced via digitization (in Early English Books Online (EEBO) and, to a lesser extent, Eighteenth Century Collections Online (ECCO)), create significant challenges for OCR software attempting to recognize the text content of these images. Similarly, fanzines, often hand-made and featuring irregular layouts, can present the same kinds of problems for OCR engines. The authors were interested in exploring how the tools and workflows created for eMOP could be utilized in the Austin Fanzine Project. They also were eager to see how tools like From the Page could further innovate the workflow on projects like AFP. The authors will present on the challenges faced and lessons learned in modifying tools and workflows designed for early modern print documents to work with hand-made fanzines, and the ways in which different tools can be used to create innovative and collaborative work and research spaces.

Disciplines

Arts and Humanities | Digital Humanities

Publication Date

4-10-2015

Language

English

Share

COinS