Automated Analysis of Historical Documents
Researchers in the humanities can spend years going to collections all throughout the world to find the primary sources that will give them the key discoveries that help us understand our cultural heritage. Unfortunately, the ability to access these documents can be limited by all sorts of external factors. From protective archivists, international travel restrictions, and even simple lack of resources, it is not always possible to physically get to them. Therefore, recent efforts have been made to digitize large collections of historical handwritten manuscripts, and make the scanned images available online. The transcription of handwritten historical documents into machine-encoded text has always been a difficult and time-consuming task, and in fact entire academic careers are built around transcribing individual codices and producing a definitive edition. The automatic transcription of handwritten text is known as Handwritten Text Recognition, and it is a robust research area for both modern and historical documents, but there are unique challenges that come when working with historical documents.
We look at how measuring human performance and incorporating that information into the loss function can improve handwritten text transcription on both medieval Latin manuscripts and modern English and French handwriting. We will also summarize an interdisciplinary collaborative project in which my collaborators and I created an easy-to-use open-source tool that converts an image of a manuscript page written in the historical Ethiopic script of Ge'ez into a transcription.
Finally, we introduce automated handwriting identification tools for which the results can be quickly visually understood and assessed, and used as one feature among many by expert paleographers when attributing previously unknown scribal hands. We also demonstrate a use case for our software by analyzing several items believed to be written by Thomas Hoccleve, a highly productive clerk of the Privy Seal who also happens to be an important fifteenth-century English poet.
History
Defense Date
2023-07-28CIP Code
- 40.0501
Research Director(s)
Walter J. ScheirerCommittee Members
Adam Czajka Kevin Bowyer Gelila TilahunDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
OCLC Number
1411842187Additional Groups
- Computer Science and Engineering
Program Name
- Computer Science and Engineering