[Preprint] The Engineering History Project Database: Creating and Liking Datasets from Structured, Semi-Structured, and Unstructured Historical Sources
posted on 2025-10-16, 16:45authored byIsrael Solares, Edward Beatty
<p dir="ltr">This paper describes the methods used to collect, organize, clean, and validate data drawn from three different types of digitized historical sources, and subsequently linked in a relational database. The constituent data are available as three separate datasets or in a linked, relational format. This paper describes the methods in detail. </p><p dir="ltr">The datasets can be located and cited as follows:</p><p dir="ltr">Israel G. Solares and Edward Beatty (2025). <i>Engineering History Project Dataset</i> (Version v.1) [Dataset]. CurateND. https://doi.org/10.7274/30108082. </p><p dir="ltr">The project uses three different types of digitized historical sources – one containing structured information, one semi structured, and one unstructured – we construct a relational database that connects individuals, firms, and textual material related to individuals and firms. The research project examines the emergence of professional engineering, 1870-1930, and uses the global mining sector as a case study. This paper explains the methods used to construct the initial three constituent datasets, including techniques to clean and validate each. It then explains the methods used to transform and link those datasets, creating a relational database that includes information on roughly 130,000 individuals, over 50,000 firms, and almost 400,000 journal articles. We are able to trace individuals, firms, and technologies over time and space and identify interconnected communities and networks in a globalized setting. This is a preprint version.</p>