Back in 2012, HRI Digital ran a project, with the departments of English, History, and Sociological Studies, looking at participatory search design. The project took as its focus a subset of George Thomason’s 17th-century newsbooks, transcribing every issue of Mercurius Politicus plus the full selection of newsbooks published in 1649 (from the images available through ProQuest’s Early English Books Online). Building the interactive interface, the Newsbooks project focused on how researchers interact with (and want to interact with) such historical texts. Thus, for example, search results may feature texts published at the same point in time. A problem not resolved in the original phase was variant spellings, and the humanities investigators held onto concerns about (in)accuracy in the transcriptions.
The tools tried out for Linguistic DNA have provided a fresh mechanism to improve the Newsbooks’ searchability. Sheffield MA student Amy Jackson recently completed a 100-hour work placement investigating how a MorphAdorned version of the Newsbooks could inform questions about the accuracy of transcriptions, and how a statistically-organised representation of the language data (an early output of LDNA’s processor) affects understanding of the content and context of Thomason’s collection.
Amy reports:
My main task during my placement has been to find errors within the newsbooks, both printing and transcription errors, in order to improve the searchability of the newsbooks. I’ve been using methods such as checking hapax legomena (words that only occur once within a text or collection of texts) and Pointwise Mutual Information (PMI).
Note from the editors: PMI measures word associations by comparing observed cooccurrences with what might be expected in a random wordset (based on the same data).
—Expect a blog post on this soon!
The hurried composition of the newsbooks causes problems for searchability. It seems those printing the newsbooks were less concerned with accuracy than those who were printing books. This can be seen in several examples that I have found while searching through the hapax legomena. For example, on one occasion ‘transmitted’ is printed as ‘trasmitte4’ with a ‘4’ being used as a substitute for a missing ‘d’ (see image above). Elsewhere the number ‘8’ is used as a substitute for a capital ‘S’, printing ‘Sea’ as ‘8ea’. Such printing decisions present a specialised problem for searches because they are unusual. Knowing this characteristic (replacing letters with numbers) means one can look at modifying search rules to improve the ‘success’ in finding relevant information.
High PMI values can also be used to find unusual words or word pairs that aren’t errors. While I was searching through the high PMI values I came across the word ‘King-chopper’ – used as an insult to refer to Colonel John ‘Tinker’ Fox who was falsely rumoured to be one of King Charles I’s executioners in 1649. The Man in the Moon, the newsbook in which the reference appears, was printed by John Crouch. Crouch was a Royalist journalist who was arrested and imprisoned for printing The Man in the Moon after the King’s death.
Mid-range PMI values are useful for understanding how language was used in the newsbooks. ‘Loyal’ often co-occurs with words such as ‘crown’, ‘royalist’, ‘sovereign’, ‘majesty’, ‘Charles’, ‘usurp’, and ‘treason’. This implies that the word ‘loyal’ is mainly being used by Royalist newsbooks in 1649 rather than Parliamentarian newsbooks. If I had more time I would look more closely at the differences in the language used by Royalist and Parliamentarian newsbooks.
PMI and hapax legomena have been useful for finding errors within the newsbooks but they have mainly provided an interesting way for me to interact with the texts. The PMI data often encouraged me to research the newsbooks and the people who printed them further and hapax legomena have provided useful insights into how the newsbooks were printed in 1649.