Linguistic DNA at SRS 2018: Abstracts

Knowledge, truth and expertise: experiments with Early English Books Online

Wondering what Linguistic DNA is bringing to the Society for Renaissance Studies? Here are the abstracts for two panels of papers, and information about our hands-on demonstration session (drop in).

United by a common interest in data-driven approaches to meaning and a focus on the transcribed portions of Early English Books Online (EEBO-TCP), this interdisciplinary panel brings together new research from the Linguistic DNA project and the Cambridge Concept Lab. 

What is EEBO anyway? Contextual study of a universe in print
Iona Hine and Susan Fitzmaurice (University of Sheffield)

Since 2015, the Linguistic DNA team has been developing methods for mapping meaning and change-in-meaning in Early Modern English. Our work begins with the hypothesis that meanings are not equivalent with words, and can be invoked in many different ways. For example, when Early Modern writers discuss processes of democracy, there is no guarantee they will also employ a keyword such as democracy. We adopt a data-driven approach, using measures of frequency and proximity to track associations between words in texts over time. Strong patterns of co-occurrence between words allow us to build groups of words that collectively represent meanings-in-context (textual and historical). We term these groups “discursive concepts”.

The task of modelling discursive concepts in textual data has been absorbing and challenging, both theoretically and practically. Our main dataset, transcriptions of texts from Early English Books Online (EEBO-TCP), contains more than 50 000 texts. These include 9000 single-page broadsheets and 162 volumes that span more than 1000 pages. There are 127 items printed pre-1500, and nearly 7000 from the 1690s. The process of analysis therefore requires us to think carefully about how best to control and report on this variation in data distribution.

One particular question that has arisen affects all who attempt to use EEBO: what is in it? To what extent is its material from pre-1500 similar in kind (genre, immediacy, etc.) to that of the messy 1550s (as the English throne shifted speedily between Edward VI and his siblings), the 1610s (era of Shakespeare and the King James Version), or the 1640s (when Civil War raged)? This paper is a sustained reflection on attempts to find out “What’s in EEBO?”

In the beginning was the word?
EEBO-TCP and another universe of meaning

Seth Mehl (University of Sheffield)

When a new idea is conceived, how does it find expression in language? Between 1450 and 1750, the English lexicon expanded dramatically, and literary scholars, philologists, linguists, and historians have sought to document and demonstrate the paths taken by key social and cultural vocabulary, charting the history of what would become key social and cultural ideas, discourses, and concepts. In such cases, the topic and language for investigation has been intuited on the basis of extended qualitative reading, and the objects of investigation tend to be individual words. With the advent of a searchable database of early modern texts, such intuitions can be tested at scale, and the initial object of inquiry can shift from individual words to relationships between sets of words.

What happens when we invert the traditional process, taking the thousands of texts digitised in EEBO-TCP and applying computational techniques to model language change independent of human intuition? Can such techniques indicate meaningful relationships between key words that human researchers had not intuited or observed? To what extent do observations founded on over 1 billion words of early modern English correspond to and diverge from what scholarly readers have already inferred? Is it possible to identify discourses around key ideas even when the apparently related key words are absent? Combining insights from the Keywords Project with tools developed by the Linguistic DNA project, this paper will explore how concept modelling can be applied to re-examine meaning in early modern texts.

Beyond Power Steering:
re-constituting structures of knowledge in 17th-century texts

John Regan (University of Cambridge)

One of the axioms of the Cambridge Concept Lab is that digital means of enquiry should provide qualitatively new kinds of knowledge, if we are to realise their full value. This is to say, that computation should not merely provide ‘power steering for the humanities’, but allow one to discover something different in kind about how knowledge was structured in the past.

Making good on this axiom necessitates judgements on the part of the user of digital technology about how to design one’s modes of address to (for example) natural language data sets such as Early English Books Online- TCP, in order that one is not only adding ‘power steering’ to existing, familiar types of enquiry. It also necessitates making decisions about when to come to rest at results (that is, when to cease enquiry); judgements of where digital data can be said to be producing discrete and unfamiliar forms of knowledge.

This paper will present tentative first signs of what the Cambridge Concept Lab believe are historically-discrete conceptual structures, based on data from the early seventeenth-century portion of EEBO-TCP. Two such structures will be described, one entitled ‘Mutual Dependence’, the other ‘Self-Consistency’. As will be shown, familiar forms of knowledge that are held and expressed in sentences and paragraphs, organised by grammar and understood by readers largely as explicit sense, may be contrasted with this evidence of qualitatively different conceptual structures in the textual record. While this paper does not set out to debunk existing theories of the structuration of knowledge and its transmission in the seventeenth century as have become established through centuries of close reading, it does seek to enrich our understanding of these traditions by attending to conceptual, and not exclusively semantic, thematic or rhetorical, structures.

It appears uncontroversial to assert that concepts are determining with regard to features of language use such as explicit and implicit semantic fields, theme, word order, and syntactic relations at the level of the sentence. Nevertheless, recognising that concepts have lexical and semantic extension is not the same as accepting that the two are identical in kind. This paper’s claims about conceptual structure will be based upon evidence from the early decades of seventeenth-century data from EEBO-TCP.

Our afternoon panel is a little depleted (by ill-health) but features Jose M. Cree (Sheffield) on Neologisms and the English reformation, Lucas van der Deijl (Amsterdam) on The collaborative Dutch translations of Descartes by Jan Hendrik Glazemaker (1620-1682), and a little extra time for discussion.


All SRS delegates are very welcome to drop in to our demo workshop, where we will be providing a 10-15-minute introduction to our tools (3:30pm, repeated at 4:30pm) and the opportunity for hands-on experimentation.  This is in the Hicks Building, Floor G, room 29. (About 2 minutes walk from Jessop West, across the main road and a little uphill. Directions.)

Snapshot from campus map, featuring the Hicks Building.

Translation, Gender, Sexuality: a report from Genealogies of Knowledge 2017

In December 2017, Sheffield MA student Nathaniel Dziura attended part of the Genealogies of Knowledge conference in Manchester. While the LDNA team were exchanging conceptual insights with other data-driven scholars, Nathaniel participated in sessions connected to a different field of interest. He writes:

As a member of the LGBTQ+ community, I am keen to contribute to research on how social factors impact language use, particularly gender and sexuality. As a second-generation Polish immigrant, raised with influence from both Polish and English culture, I am also very interested in the effect cultural background can have on the production of linguistic features.

Next year, I hope to start a PhD focused on this interplay between social and linguistic elements. Schumann (1978) suggested that the degree of ‘acculturation’ influences use of non-standard variants in second language learners. In other words, if the speaker is more immersed in the culture of their second language, they will be more likely to acquire native speaker-like linguistic variation. However, previous studies have not considered how other social factors such as sexuality might affect which features are acquired. This is despite previous studies having shown certain linguistic features to be cross-culturally associated with LGBTQ+ membership. These features include fronted-/s/ (Levon, 2006; Pharao et al., 2014) – colloquially stereotyped as the ‘gay lisp’ – and creaky-voice (Zimman, 2013: 3) – speaking with a low elongated ‘creak’, like a stereotypical ‘valley girl’. LGBTQ+ people do not inherently use these features, but they can play an important part in interaction (Barrett, 2017: 9).

I want to help fill this gap in the research by investigating how sexuality might affect the linguistic variants acquired in English by second language speakers (specifically, Polish migrants to England). I will examine whether the use of these features differs depending on two variables: the level of integration into British culture. And the level of involvement with the LGBTQ+ community.

This was the project I had in mind as I headed to Manchester for the conference. I was rewarded by an excellent thematic session on ‘Translation, Gender, Sexuality’.

I found Przemysław Uściński and Agnieszka Pantuchowicz’s presentations to be pertinent and insightful. Uściński’s talk focused on the downfalls with approaching Queer Theory in Poland from a ‘Western perspective’. The political environments in Poland and England have differed historically, and continue to do so. Uściński argues that ‘LGBT emancipation’ has not yet occurred in Poland. Critical theorisations of gender are intentionally scarce in Polish academic discourse. The reception of Queer Theory in academia has been comparatively belated, and has sometimes discredited the LGBTQ+ movement. British society has its share of problems with LGBTQ+-phobia. Yet, Poland has seen much far-right and religious rejection of the LGBTQ+ community. These groups have dismissed LGBTQ+ identities as ‘Western secular propaganda’ and ‘gender ideology’. So, English translations of concepts within Queer Theory, which are gradually being introduced to Polish academic works, reflect English notions and societal progress. Even when concepts from Queer Theory enter Polish, there is no possibility for their dissemination within Polish society. Queer Theory tends to be viewed as a ‘foreign’ and subversive concept. A theoretical importation into Polish from English, and not one congruous with Polish culture.

In another paper, Pauline Henry-Tierney noted that misinterpretations in translation of Beauvoir’s ‘Mauvaise Foi’ have slowed academic progress on the subject. Taking this into account, perhaps misinterpretations of Queer Theory as a ‘foreign’ concept to Poland are hindering the normalisation of LGBTQ+ concepts and perpetuate their perception as something radical and provocative.

This thematic session highlighted that introducing concepts into a language through translation can be a step towards spreading those ideas within another culture. However, this alone might not be enough to achieve society’s understanding and acceptance of those concepts. The translation of Queer Theory between cultures was not an issue I had previously considered. This thematic session reinforced that the political and social environments in Polish and English culture exhibit stark differences. This is significant within the framework of acculturation: LGBTQ+ community membership is arguably more accepted in British culture, and consequently so are associated non-standard language features. So one might predict that LGBTQ+ Polish migrants to England who become more British-acculturated are more likely to produce non-standard features associated with LGBTQ+-community membership than those who are less British-acculturated.

Overall, I was able to interact with academics from areas such as translation studies and politics with whom I would not otherwise be able to network. I am very grateful to the Linguistic DNA team for inviting me to attend the conference. The insights it has given me will be useful in my academic pursuits!

Featured image:
Jaap Verheul (Utrecht) presents an example from ShiCo research at the Genealogies of Knowledge conference, 8 December. Photo (c) I.C. Hine.


Barrett, R. (2017) From Drag Queens to Leathermen: Language, Gender, and Gay Male Subcultures (Studies in Language Gender and Sexuality) Oxford: Oxford University Press

Henry-Tierney, P. (2017) ‘Translating in ‘Bad Faith’? Articulations of Beauvoir’s ‘Mauvaise Foi’ in English’, Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester


Pantuchowicz, A. (2017) ‘Translation and the Failure of Gender Mainstreaming in Poland’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Pharao, N., M. Maegaard, J. S. Møller & T. Kristiansen (2014) ‘Indexical meanings of [s] among Copenhagen youth: Social perception of a phonetic variant in different prosodic contexts’ Language in Society 43, 1–31

Schumann, J. H. (1986). Research on the acculturation model for second language acquisition. Journal of Multilingual and Multicultural Development, 7, 379-392

Uściński, P. (2017) ‘Thinking Sexuality/Translating Politics: Queerness in(to) Polish’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Zimman, L. (2013) ‘Hegemonic masculinity and the variability of gay-sounding speech: The perceived sexuality of transgender men’ Journal of Language & Sexuality 2 (1): 1-39

Seth and Iona present a joint paper with LDNA data at Genealogies of Knowledge. Photos (c) Japp Verheul.

The Edge

LDNA at Digital Humanities Congress 2016, Sheffield

LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:

Under the surface: SHARP, LDNA and sundry sources

This blog post excerpts material Iona wrote reflecting back on her contribution to the SHARP conference in Paris in July 2016, building on the work of her PhD thesis and incorporating material and processes that have formed part of the Linguistic DNA project. The full post can be found on Iona’s personal blog.

In preparation for the paper, I dedicated time to manually extract, compile and refine measurements for some of the early outputs from the LDNA processor. To fit in with the pledges of my abstract, I targeted the associations of valour and valiant in subsets of EEBO-TCP.

During my PhD, I used EEBO-TCP to provide context for my work with early modern bibles. Valour entered the equation as I examined trends in the translation of a Hebrew collocation gibbor chayil. In the King James Version (publ. 1611) most gibbor chayil men are “mighty . . . of valour”. The repetition of this phrase across the translation means that English bible readers could form associations between the group of characters referred to, in a similar manner to those who encounter the Hebrew narrative directly. For this to happen in translation shows that the translators recognised and (sometimes) prioritised the transmission of this connection; in this respect “mighty of valour” is a partial example of a larger trend in favour of a more technical approach to translation, a move likely influenced by the increasing use of precise cross-referencing in bible reading (facilitated by the introduction of verse numbers throughout the Bible, an innovation of the 1550s). Yet the phrase is intrinsically interesting because before that “valour” was not part of the English biblical lexicon.

Collating instances of gibbor chayil demonstrates that the lexically related “valiant” was used in earlier translations, but in a piecemeal manner (illustrated by the changing distribution of black square bullets in the diagram below).


This diagram, extracted from my SHARP presentation, is one of a series colour-coded to highlight consistency within individual versions with a focus on the characterisation of Boaz. The black square bullets are added to highlight where a form of ‘valiant’ (or for KJ ‘valour’) was used.

By exploring the words valiant and valour with the LDNA tools, I was able to corroborate the impression I had formed during my earlier quantitative and qualitative analysis which was conducted via a standard EEBO-TCP interface.

The PhD bit

Searching hits in the population for the first century of English print (to 1570) and comparing that with the next half century (a collection of documents three times the size) I had observed that the frequency of both valiant and valour increased markedly above expectation.


Comparison of word frequency (hits) and distribution (records, hits per record) in EEBO-TCP for 1473-1570 (P1) and 1571-1620 (P2) expressed in ratios.

Scrutinising the data by decade exposed some significant textual influences. To quote from my thesis:

87 per cent of occurrences of “valiant” in the corpus for 1520-1529 (316 of a total 363) appear in a two-volume translation of the French chronicles of Froissart, while two other translated works account for a further 9 per cent; just 4 per cent of hits occur in ‘indigenous’ texts.

For “valour”,

a jump in the decade 1570-1579 is significantly related to the publication in 1579 of a translation from Italian: 403 of the decade’s 501 hits appear in a one-volume translation of The historie of Guicciardin conteining the vvarres of Italie and other partes (London, 1559). Once such scrutiny is imposed, it becomes evident that translation had a significant role in the increased currency of these two Latinate terms. It is also evident that the words normally appear in certain genres: conduct books concerned with warfare and chivalric behaviour; and chronicles of past history. This contributes to the recognisable sense of valour as “The quality of mind which enables a person to face danger with boldness or firmness; courage or bravery, esp. as shown in warfare or conflict; valiancy, prowess.”[ OED s.v. “valour|valor, n.”, §1c.] This sense, cultivated through translation in the course of the sixteenth-century, fits the context in which King James’ translators employ the word.

The LDNA bit

The subsets of EEBO-TCP sent through the LDNA processor earlier in the year were intentionally compatible with the periodisation of my thesis, providing windows onto English discourse that could be cross-referenced with the publication of particular bibles. The subsets thus incorporate all transcribed material from EEBO (TCP update 2015) known to have been printed during the following spans:

  • 1520-1539 (cf. Coverdale Bible 1535, Matthew Bible 1537, Great Bible 1539)
  • 1550-1559 (Geneva Bible 1560, Bishops Bible 1568); and
  • 1610-1611 (Douai Old Testament 1609-10, King James Version 1611).

Taking the first and last of these, measuring PMI in windows of discourse around the word “valour”, we find marked change in the prominent associations. Our approach yields plentiful data, and we are still thinking through the challenges of visualisation. In the slide shown, I have coloured associated terms according to the innermost window in which the cooccurring lemma rises to prominence. Thus red terms occur frequently in the narrowest window around valour (+/-1 words), orange terms in the expanded window (+/-10 words) that might approximate the surrounding sentence, green for +/-50 words (which now form the default window size in our public interface) and blue for the wide discursive window of +/-100 words. (Many lemmas appear in more than one window, and the list shown for the later period does not reach to some relevant low frequency items such as “prowess”.)


What should be visible is a distinction between the use of “valour” as a synonym of value or worth (prominent in the 1520-1539 subset), and the association with conduct in conflict (dominant in the 1610-1611 dataset). Both senses were part of the Latin root “valeo” and, had King James’ translators ventured it, both could have been played upon to make even more “mighty men of valour” in 1611. (One of the exceptions comes at 2 Kings 15:20, where Menachem taxes all gibbor chayil men, “mighty men of wealth” in the KJV.)

Inevitably, the set of observations I could draw from this investigation are not part of the bottom-up process that LDNA strives to achieve. But the exercise has helped me to think through some different ways we will want to be able to interrogate our data and to study the effects of some different baselines for our expectation calculations. And it demonstrates, I think, the valour of conducting semantic enquiries through discursive windows.



Thesis quotations are from: I. C. Hine, “Englishing the Bible in early modern Europe: The case of Ruth”, PhD thesis (University of Sheffield, 2014), p. 163. These numbers reflect searches conducted through the Chadwyck EEBO interface using its variant spelling option.

The datasets employed in my thesis are not quite identical to those used by the project: LDNA uses a slightly expanded version of the EEBO-TCP collection (last updated early 2015) with its spelling regularised and tokens lemmatised locally using MorphAdorner.

Digital Humanities 2016, Kraków

Conference reflections jointly written with Justyna Robinson

Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.

Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.

Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.


At the poster session

Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. [1]

Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised.  The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.

However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.

For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.


1. Studying Linguistic Changes on 200 Years of Newspapers, Vincent Buntinx, Cyril Bornet, Frédéric Kaplan (EPFL (École polytechnique fédérale de Lausanne), Switzerland)

Text Analytics at Sheffield DH Congress

Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.

As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.


If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.

Last registration date is 7 September.

QPL Semantic Spaces Workshop, University of Strathclyde

The workshop day, titled ‘Semantic Spaces at the Intersection of NLP, Physics, and Cognitive Science’, was part of a larger Quantum Physics and Logic (QPL) conference held at the University of Strathclyde. The workshop focussed on computational approaches to modelling semantics and semantic relations in language. The day was divided into three parts: the first session was concerned with the application of principles derived from physics and formal logic to the expression of linguistic phenomena; the middle section segued this into consideration of Natural Language Processing (NLP); whilst the final section covered cognitive science and cognitive linguistics’ views of semantics. My interest in attending this was to get an idea of the approach which ‘hard science’ is taking to aspects of semantics which overlap with the research of the Linguistic DNA project, as well as to see if there were anything that we might be able to apply to our own work.

A subject which struck a chord was the discussion of vector space modelling, which is near the top of our list of topics to be implemented as we approach the point where we move from identifying word pairs to establishing clusters of related words. The subject was touched on in several of the papers, with particular relevance to the final paper of the day, in which Stephen McGregor described work done by himself and colleagues to locate ‘subspaces’ within vector space models which delineate an analogical relationship between different words. Beginning with an SAT-style statement that ‘dog is to cat as puppy is to kitten’, the paper used PMI measurements as a basis on which to plot these words in vector space, and then examined the geometrical relationship of the points to demonstrate how it might be possible to define a subspace within the vector space and thus automatically identify the positions of analogical partners words or concepts.

The NLP section of the workshop was dominated by Categorical Compositional Distributional semantics and the ways in which researchers using this approach are mapping the emergence of meaning from syntactic structure. The morning’s physics papers had discussed in some detail the application of formal logic expressions to sentence semantics, describing, for example, the way in which a transitive verb combines with subject and object nouns to ‘output’ the meaning of the sentence. These papers applied this theoretical approach to specific sentence elements, such as Dimitri Kartsaklis’ analysis of coordination and Mehrnoosh Sadrzadeh’s study of quantifiers. To me, these papers chimed with work Seth has been doing, considering the importance of handling different parts of speech in different ways during processing; they made clear the flaws in the so-called ‘bag-of-words’ approach to computational linguistics and highlighted that, in the long run, consideration of syntax should be an important part of the kind of computational semantics we’re undertaking.

Also of special interest was Peter Gärdenfors’ consideration of domains as components of word meanings. In the main, the point was illustrated through consideration of nouns (although touching on other parts of speech), asking whether it might be helpful to think of words as fundamentally dependent on domains such as place, shape, and temperature (so that, for example, ‘round’ maintains some connection to the sense of a curve in physical space even when not used as a noun, whilst most ‘verbed’ nouns retain important connections to their parent’s referent). Whilst bearing mostly indirect applicability to current LDNA work, this discussion is important food for thought, especially for its potential impact on the encyclopedic aspect of a word’s semantics in context.

The workshop provided a thought-provoking day to a relative outsider, offering an important viewpoint on the other approaches to semantics which are being pioneered outside of arts faculties, an awareness which can only strengthen our own work. I’d like to thank the organisers and contributors to the workshop for a hugely interesting and intellectually engaging day.

From Spring to Summer: LDNA on the road

June 2016:
For the past couple of months, our rolling horizon has looked increasingly full of activity. This new blogpost provides a brief update on where we’ve been and where we’re going. We’ll be aiming to give more thorough reports on some of these activities after the events.

Where we’ve been

Entrance to University Museum, UtrechtIn May, Susan, Iona and Mike travelled to Utrecht, at the invitation of Joris van Eijnatten and Jaap Verheul. Together with colleagues from Sheffield’s History Department, we presented the different strands of Digital Humanities work ongoing at Sheffield. We learned much from our exchanges with Utrecht’s AsymEnc and Translantis research programs, and enjoyed shared intellectual probing of visualisations of change across time. We look forward to continued engagement with each others’ work.

A week later, Seth and Justyna participated in This&THATCamp at the University of Sussex (pictured), with LDNA emerging second in a popular poll of topics for discussion at this un-conference-style event. Productive conversations across the two days covered data visualisation, data manipulation, text analytics, digital humanities and even data sonification. We hope to hear more from Julie Weeds and others when the LDNA team return to Brighton in September.

Next week, we’ll be calling on colleagues at the HRI to talk us through their experience visualising complex humanities data. Richard Ward (Digital Panopticon) and Dirk Rohman (Migration of Faith) have agreed to walk us through their decision-making processes, and talk through the role of different visualisations in exploring, analysing, and explaining current findings.

Where we’re going

The LDNA team are also gearing up for a summer of presentations:

  • Justyna Robinson will be representing LDNA at Sociolinguistics Symposium (Murcia, 15-18 June), as well as sharing the latest analysis from her longitudinal study of semantic variation focused on polysemous adjectives in South Yorkshire speech. Catch LDNA in the general poster session on Friday (17th), and Justyna’s paper at 3pm on Thursday. #SS21
  • Susan Fitzmaurice is in Saarland, as first guest speaker at the Historical Corpus Linguistics event hosted by the IDeaL research centre, also on Thursday (16th June) at 2:15pm. Her paper is subtitled “Discursive semantics and the quest for the automatic identification of concepts and conceptual change in English 1500-1800”. #IDeaL
  • In July, the Glasgow LDNA team are Krakow-bound for DH2016 (11-16 July). The LDNA poster, part of the Semantic Interpretations group, is currently allocated to Booth 58 during the Wednesday evening poster session. Draft programme.
  • Later in July, Iona heads to SHARP 2016 in Paris (18-22). This year, the bi-lingual Society are focusing on “Languages of the Book”, with Iona’s contribution drawing on her doctoral research (subtitle: European Borrowings in 16th and 17th Century English Translations of “the Book of Books”) and giving attention to the role of other languages in concept formation in early modern English (a special concern for LDNA’s work with EEBO-TCP).
  • In August, Iona is one of several Sheffield early modernists bound for the Sixteenth Century Society Conference in Bruges. In addition to a paper in panel 241, “The Vagaries of Translation in the Early Modern World” (Saturday 20th, 10:30am), Iona will also be hosting a unique LDNA poster session at the book exhibit. (Details to follow)
  • The following week (22-26 August), Seth, Justyna and Susan will be at ICEHL 19 in Essen. Seth and Susan will be talking LDNA semantics from 2pm on Tuesday 23rd.

Back in the UK, on 5 September, LDNA (and the University of Sussex) host our second methodological workshop, focused on data visualisation and linguistic change. Invitations to a select group of speakers have gone out, and we’re looking forward to a hands-on workshop using project data. Members of our network who would like to participate are invited to get in touch.

And back in Sheffield, LDNA is playing a key role in the 2016 Digital Humanities Congress, 8-10 September, hosting two panel sessions dedicated to textual analytics. Our co-speakers include contacts from Varieng and CRASSH.  Early bird registration ends 30th June.

Concepts Slide

Operationalising concepts (Manifesto pt. 3 of 3)

Concepts Slide

Properties of concepts, from Susan Fitzmaurice’s presentation

This blog post completes our series of three extracts from Susan Fitzmaurice’s paper on “Concepts and Conceptual Change in Linguistic DNA”. (See parts 1 and 2.)

The supra-lexical approach to the process of concept recognition that I’ve described depends upon an encyclopaedic perspective on semantics (e.g. cf. Geeraerts, 2010: 222-3). This is fitting as ‘encyclopaedic semantics is an implicit precursor to or foundation of most distributional semantics or collocation studies’ (Mehl, p.c.). However, such studies do not typically pause to model or theorise before conducting analysis of concepts and semantics as expressed lexically. In other words, semasiological (and onomasiological) studies work on the premise of ready-made or at least ready lexicalised concepts, and proceed from there. This means that although they depend upon the prior application of encyclopaedic semantics, they themselves do not need to model or theorise this semantics because it belongs to the cultural messiness that yields the lexical expressions that they then proceed to analyse.

For LDNA, concepts are not discrete or componential lexical semantic meanings; neither are they abstract or ideal. Instead, they consist of associations of lexical/phrasal/constructional semantic and pragmatic meanings in use.
This encyclopaedic perspective suggests the following operationalisation of a concept for LDNA:

  1. Concepts resemble encyclopaedic meanings (which are temporally and culturally situated chunks of knowledge about the world expressed in a distributed way) rather than discrete or componential meanings. [This coincides with non-modular theories of mind, which adopt a psychological approach to concepts.]
  2. Concepts can be expressed in texts by (typically a combination of) words, phrases, constructions, or even by implicatures or invited inferences (and possibly by textual absences).
  3. Concepts are traceable in texts primarily via significant syntagmatic (associative) relations (of words/phrases/constructions/meanings) and secondarily via significant paradigmatic (alternate) relations (of words/phrases/constructions/meanings).
  4. A concept in a given historical moment might not be encapsulated in any observed word, phrase, or construction, but might instead only be observable via a complete set of words, phrases, or constructions in syntagmatic or paradigmatic relation to each other.

It is worth noting however, that concept recognition is particularly difficult (for the automatic processes built into LDNA) because it ordinarily depends upon the level of cultural literacy possessed by a reader. This is a quality which, while we cannot incorporate it as a process, we can take it into account by testing distant reading through close reading.

As well as being encyclopaedic, our approach is also experiential, in that the conceptual structure of early modern discourse is a reflection of the way early modern people experienced the world around them. That discourse presents a particular subjective view of the world with the hierarchical network of preferences which emerges as a network of concepts in discourse. In this way we also assume a perspectival nature of concept organisation.

Concluding remarks: Testing and tracking conceptual change across time and style

All being well, if we succeed in visualising the results of an iterative and developing set of procedures to inspect the data from these large corpora, we hope to be able to discern and locate the emergence of concepts in the universe of early modern English print. A number of questions arise about where and how these will show up.

For instance, following our hypothesis, will we see the cementation of a concept in the persistent co-occurrence in particular contexts of candidate conjuncts (both binomials and alternates), bigrams, and ultimately, ‘keywords’? (e.g. ‘man of business’ → ‘businessman’ in late Modern English newspapers)

And, as part of the notion of context, it is worth considering the role of discourse genre in the emergence of a concept and in conceptual change. For instance, if it is the case that a concept emerges, not as a keyword, but in the form of an association of expressions that functions as a loose paraphrase, is this kind of process more likely to occur in a specific discourse genre than in general discourse? In other words, is it possible that technical or specialist discourses will be the locus of new concepts, concepts which might diffuse gradually into public and more general ones? (e.g. dogma, law, science → newpapers, narrative, etc.)

What we hope to do is to make our approach manifest and our results visual. For instance, the emergence of a concept might be envisaged as clusters of texts rising up on the terrain representing a certain feature. And the reminder that they might not just gradually change over time, rising and falling across the terrain, but there might instead be islands of certain features that appear in distant time periods, disparate genres, sub-genres. All of that can be identified by the computer, but we have to make sense of it as close readers afterwards.


Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: OUP.