HRI Digital | Linguistic DNA

Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.

As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.

If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.

Last registration date is 7 September.

Back in 2012, HRI Digital ran a project, with the departments of English, History, and Sociological Studies, looking at participatory search design. The project took as its focus a subset of George Thomason’s 17th-century newsbooks, transcribing every issue of Mercurius Politicus plus the full selection of newsbooks published in 1649 (from the images available through ProQuest’s Early English Books Online). Building the interactive interface, the Newsbooks project focused on how researchers interact with (and want to interact with) such historical texts. Thus, for example, search results may feature texts published at the same point in time. A problem not resolved in the original phase was variant spellings, and the humanities investigators held onto concerns about (in)accuracy in the transcriptions.

The tools tried out for Linguistic DNA have provided a fresh mechanism to improve the Newsbooks’ searchability. Sheffield MA student Amy Jackson recently completed a 100-hour work placement investigating how a MorphAdorned version of the Newsbooks could inform questions about the accuracy of transcriptions, and how a statistically-organised representation of the language data (an early output of LDNA’s processor) affects understanding of the content and context of Thomason’s collection.

Amy reports:

My main task during my placement has been to find errors within the newsbooks, both printing and transcription errors, in order to improve the searchability of the newsbooks. I’ve been using methods such as checking hapax legomena (words that only occur once within a text or collection of texts) and Pointwise Mutual Information (PMI).

Note from the editors: PMI measures word associations by comparing observed cooccurrences with what might be expected in a random wordset (based on the same data).
—Expect a blog post on this soon!

The hurried composition of the newsbooks causes problems for searchability. It seems those printing the newsbooks were less concerned with accuracy than those who were printing books. This can be seen in several examples that I have found while searching through the hapax legomena. For example, on one occasion ‘transmitted’ is printed as ‘trasmitte4’ with a ‘4’ being used as a substitute for a missing ‘d’ (see image above). Elsewhere the number ‘8’ is used as a substitute for a capital ‘S’, printing ‘Sea’ as ‘8ea’. Such printing decisions present a specialised problem for searches because they are unusual. Knowing this characteristic (replacing letters with numbers) means one can look at modifying search rules to improve the ‘success’ in finding relevant information.

High PMI values can also be used to find unusual words or word pairs that aren’t errors. While I was searching through the high PMI values I came across the word ‘King-chopper’ – used as an insult to refer to Colonel John ‘Tinker’ Fox who was falsely rumoured to be one of King Charles I’s executioners in 1649. The Man in the Moon, the newsbook in which the reference appears, was printed by John Crouch. Crouch was a Royalist journalist who was arrested and imprisoned for printing The Man in the Moon after the King’s death.

Mid-range PMI values are useful for understanding how language was used in the newsbooks. ‘Loyal’ often co-occurs with words such as ‘crown’, ‘royalist’, ‘sovereign’, ‘majesty’, ‘Charles’, ‘usurp’, and ‘treason’. This implies that the word ‘loyal’ is mainly being used by Royalist newsbooks in 1649 rather than Parliamentarian newsbooks. If I had more time I would look more closely at the differences in the language used by Royalist and Parliamentarian newsbooks.

PMI and hapax legomena have been useful for finding errors within the newsbooks but they have mainly provided an interesting way for me to interact with the texts. The PMI data often encouraged me to research the newsbooks and the people who printed them further and hapax legomena have provided useful insights into how the newsbooks were printed in 1649.

Linguistic DNA

Modelling concepts and semantic change

Tag Archives: HRI Digital

Text Analytics at Sheffield DH Congress

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

Errors, searchability, and experiments with Thomason’s Newsbooks

Amy reports:

TA 1: Between numbers and words

Session 4, Friday 9 September ft. Hine, Shute, Siirtola et al.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September ft. Mehl, Recchia, Makela, et al.

Amy reports:

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.