Under the surface: SHARP, LDNA and sundry sources

This blog post excerpts material Iona wrote reflecting back on her contribution to the SHARP conference in Paris in July 2016, building on the work of her PhD thesis and incorporating material and processes that have formed part of the Linguistic DNA project. The full post can be found on Iona’s personal blog.

In preparation for the paper, I dedicated time to manually extract, compile and refine measurements for some of the early outputs from the LDNA processor. To fit in with the pledges of my abstract, I targeted the associations of valour and valiant in subsets of EEBO-TCP.

During my PhD, I used EEBO-TCP to provide context for my work with early modern bibles. Valour entered the equation as I examined trends in the translation of a Hebrew collocation gibbor chayil. In the King James Version (publ. 1611) most gibbor chayil men are “mighty . . . of valour”. The repetition of this phrase across the translation means that English bible readers could form associations between the group of characters referred to, in a similar manner to those who encounter the Hebrew narrative directly. For this to happen in translation shows that the translators recognised and (sometimes) prioritised the transmission of this connection; in this respect “mighty of valour” is a partial example of a larger trend in favour of a more technical approach to translation, a move likely influenced by the increasing use of precise cross-referencing in bible reading (facilitated by the introduction of verse numbers throughout the Bible, an innovation of the 1550s). Yet the phrase is intrinsically interesting because before that “valour” was not part of the English biblical lexicon.

Collating instances of gibbor chayil demonstrates that the lexically related “valiant” was used in earlier translations, but in a piecemeal manner (illustrated by the changing distribution of black square bullets in the diagram below).


This diagram, extracted from my SHARP presentation, is one of a series colour-coded to highlight consistency within individual versions with a focus on the characterisation of Boaz. The black square bullets are added to highlight where a form of ‘valiant’ (or for KJ ‘valour’) was used.

By exploring the words valiant and valour with the LDNA tools, I was able to corroborate the impression I had formed during my earlier quantitative and qualitative analysis which was conducted via a standard EEBO-TCP interface.

The PhD bit

Searching hits in the population for the first century of English print (to 1570) and comparing that with the next half century (a collection of documents three times the size) I had observed that the frequency of both valiant and valour increased markedly above expectation.


Comparison of word frequency (hits) and distribution (records, hits per record) in EEBO-TCP for 1473-1570 (P1) and 1571-1620 (P2) expressed in ratios.

Scrutinising the data by decade exposed some significant textual influences. To quote from my thesis:

87 per cent of occurrences of “valiant” in the corpus for 1520-1529 (316 of a total 363) appear in a two-volume translation of the French chronicles of Froissart, while two other translated works account for a further 9 per cent; just 4 per cent of hits occur in ‘indigenous’ texts.

For “valour”,

a jump in the decade 1570-1579 is significantly related to the publication in 1579 of a translation from Italian: 403 of the decade’s 501 hits appear in a one-volume translation of The historie of Guicciardin conteining the vvarres of Italie and other partes (London, 1559). Once such scrutiny is imposed, it becomes evident that translation had a significant role in the increased currency of these two Latinate terms. It is also evident that the words normally appear in certain genres: conduct books concerned with warfare and chivalric behaviour; and chronicles of past history. This contributes to the recognisable sense of valour as “The quality of mind which enables a person to face danger with boldness or firmness; courage or bravery, esp. as shown in warfare or conflict; valiancy, prowess.”[ OED s.v. “valour|valor, n.”, §1c.] This sense, cultivated through translation in the course of the sixteenth-century, fits the context in which King James’ translators employ the word.

The LDNA bit

The subsets of EEBO-TCP sent through the LDNA processor earlier in the year were intentionally compatible with the periodisation of my thesis, providing windows onto English discourse that could be cross-referenced with the publication of particular bibles. The subsets thus incorporate all transcribed material from EEBO (TCP update 2015) known to have been printed during the following spans:

  • 1520-1539 (cf. Coverdale Bible 1535, Matthew Bible 1537, Great Bible 1539)
  • 1550-1559 (Geneva Bible 1560, Bishops Bible 1568); and
  • 1610-1611 (Douai Old Testament 1609-10, King James Version 1611).

Taking the first and last of these, measuring PMI in windows of discourse around the word “valour”, we find marked change in the prominent associations. Our approach yields plentiful data, and we are still thinking through the challenges of visualisation. In the slide shown, I have coloured associated terms according to the innermost window in which the cooccurring lemma rises to prominence. Thus red terms occur frequently in the narrowest window around valour (+/-1 words), orange terms in the expanded window (+/-10 words) that might approximate the surrounding sentence, green for +/-50 words (which now form the default window size in our public interface) and blue for the wide discursive window of +/-100 words. (Many lemmas appear in more than one window, and the list shown for the later period does not reach to some relevant low frequency items such as “prowess”.)


What should be visible is a distinction between the use of “valour” as a synonym of value or worth (prominent in the 1520-1539 subset), and the association with conduct in conflict (dominant in the 1610-1611 dataset). Both senses were part of the Latin root “valeo” and, had King James’ translators ventured it, both could have been played upon to make even more “mighty men of valour” in 1611. (One of the exceptions comes at 2 Kings 15:20, where Menachem taxes all gibbor chayil men, “mighty men of wealth” in the KJV.)

Inevitably, the set of observations I could draw from this investigation are not part of the bottom-up process that LDNA strives to achieve. But the exercise has helped me to think through some different ways we will want to be able to interrogate our data and to study the effects of some different baselines for our expectation calculations. And it demonstrates, I think, the valour of conducting semantic enquiries through discursive windows.



Thesis quotations are from: I. C. Hine, “Englishing the Bible in early modern Europe: The case of Ruth”, PhD thesis (University of Sheffield, 2014), p. 163. These numbers reflect searches conducted through the Chadwyck EEBO interface using its variant spelling option.

The datasets employed in my thesis are not quite identical to those used by the project: LDNA uses a slightly expanded version of the EEBO-TCP collection (last updated early 2015) with its spelling regularised and tokens lemmatised locally using MorphAdorner.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”

The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd).  Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!

* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Digital Humanities 2016, Kraków

Conference reflections jointly written with Justyna Robinson

Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.

Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.

Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.


At the poster session

Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. [1]

Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised.  The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.

However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.

For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.


1. Studying Linguistic Changes on 200 Years of Newspapers, Vincent Buntinx, Cyril Bornet, Frédéric Kaplan (EPFL (École polytechnique fédérale de Lausanne), Switzerland)

Text Analytics at Sheffield DH Congress

Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.

As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.


If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.

Last registration date is 7 September.