Tag Archives: Historical Thesaurus

Looking back, looking forward: Linguistic DNA in 2016 and 2017

As we move into 2017, we’ve been looking back at achievements in 2016, and ahead to what we aim to achieve in the coming year.

2016 was an outwardly busy year as we travelled to Bruges, Essen, Krakow, Lausanne, Leeds, Brighton, Murcia, Nottingham, Paris, Saarbrucken, and Utrecht, sharing more of our thinking and early data with different audiences. Closer to “home”, we benefitted from the exchange of ideas with LDNA-hosted panels at Sheffield DH Congress and our second methodological workshop in Sussex. In 2017, we will be focusing back on our interface development and some more in-depth research, though we intend to be present at DH, SHEL, ICAME and SHARP, in order to continue some fruitful conversations.

On the blog, we have been reflecting on representativeness and the nature of EEBO-TCP. We’ve also documented our decision not to use ECCO’s OCR data to analyse eighteenth century print. You can expect to hear about the alternative 18th century datasets we’re choosing to work with later in 2017.

During the Autumn, the LDNA researchers collaborated on two articles about the project, its theory and praxis, both (hopefully) to be published this year following peer review. Generating examples from each research theme based on our early data and tying these together effectively was an enjoyable challenge, and we have already used the draft of one piece as part of our briefing materials for upcoming MA placements at The Digital Humanities Institute | Sheffield (formerly known as HRI Digital).

In the past six months, the Sheffield team have captured funding for two additional applications of the Linguistic DNA “concept modelling” tools:

The ESRC project Ways of Being in a Digital Age combines our quantitative insights with a qualitative literature survey of academic publications. Scheduled to inform the ESRC’s next programme of digital society funding, this impact-full study has compelled us toward rapid prototype development. The interface being put together to serve ‘WoBDA’ colleagues will also form the kernel of the subsequent LDNA workbench.
From next month, we are involved in another funded impact-related project, collaborating with the University of Leeds to explore the conceptual structure of millions of YouTube video comments on the theme of militarisation, as part of a larger project funded by the Swedish Research Council. This is a six-month commitment, bringing in a further research associate to theorise what’s involved in applying our measures to some very different data.

We also have three significant applications in place for other pots of funding, including Horizon 2020 collaborations, attesting confidence about our nascent processes and the multifarious opportunities for their application and impact.

Meanwhile, Glasgow has been using the present word co-occurrence data to develop its methodology for investigating processor data from the perspective of key Historical Thesaurus categories. We have continued to develop analysis of Thesaurus categories, looking for those which show abnormal instances of growth or decline; a provisional methodology for establishing statistical ‘baselines’ has been plotted out which is now being implemented and refined. Further possibilities are being tested, such as amalgamating data across whole layers of the HT hierarchy rather than by individual category, and the effects of separating out part of speech within categories or layers.

The Historical Thesaurus of English and its Related Projects

One of the resources which the Linguistic DNA project is drawing on is the Historical Thesaurus of English. Organising every word in the language, present and past, into a hierarchical structure based on word-meaning, the Historical Thesaurus is an invaluable tool for historical semantic research. The data from the Thesaurus will be involved in the internal workings of the parser programme being developed at the Sheffield Humanities Research Institute (HRI), and be present in the annotated EEBO and ECCO corpora with which the parser is working.

Structure of the Thesaurus

At its top level, the Historical Thesaurus breaks the vocabulary of English into three main categories – ‘The external world’, ‘The mental world’, and ‘The social world’. These are further subdivided so that, for example, ‘The external world’ contains within it the categories ‘The earth’, ‘Life’, ‘Health and disease’, ‘People’, and ‘Animals’, amongst others. This subdivision continues to a maximum depth of seven levels, with a category number being assigned at each level. As a result, the category ‘Daily record/journal’, for instance, has the category number 03.09.06.01.02 (noun), comprised of the following steps:

03 Society

03.09 Communication

03.09.06 Record

03.09.06.01 Written record

03.09.06.01.02 Daily record/journal

This category, then, contains all the recorded words for a journal (journal, day-book, diary, memorial, ephemeris, diurnal, journal-book, diet-book), accompanied by the date ranges in which those words are known to have been used.

Each of the seven main category levels also contains subcategories where required, allowing an exceptionally fine-grained organisation of the semantic content of the language. The Thesaurus provides, therefore, a rich seam to be mined for information on lexical and conceptual development in the English-speaking world for the past two millennia.

Related Projects

Since the completion of the first edition of the Thesaurus in 2009, projects have begun to drill down into its data. A major project, Mapping Metaphor with the Historical Thesaurus, investigated every word in the Thesaurus in order to identify systematic metaphorical links between categories. Its primary output is a complete ‘metaphor map’ of the language, which provides fascinating insights into the ways in which certain concepts are discussed in terms of others. It also demonstrates strikingly just how prevalent metaphor is in the language at the level of individual words.

The SAMUELS project utilised the Thesaurus in an entirely different way, aiming to create semantic tagging software capable of labelling every word in a text with the code of the category in which that word sense can be found in the Thesaurus. This is no easy feat, given that some words have several hundred potential meanings – ‘set’, for example, has 345 entries (not including those where it is part of a multi-word phrase), whilst ‘run’ has 302. The semantic tagging tool was created and tested by a consortium of researchers based at the Universities of Lancaster, Glasgow, Huddersfield, and the University of Central Lancashire. It is currently the only software capable of assigning word meanings based on dating information, and this diachronic tagging ability allows it to be used on texts such as those contained in EEBO and ECCO with a high degree of accuracy. The tagged Hansard corpus, comprising all the speeches made in the British Houses of Parliament between 1803 and 2005, is publicly available via Mark Davies’ corpus website at Brigham Young University, with the tagged EEBO corpus to follow.

The Historical Thesaurus and Linguistic DNA

The output of the SAMUELS project forms a large part of the input to Linguistic DNA, in the form of the semantically-tagged EEBO corpus. It is hoped that the annotation of every word in the EEBO corpus with a Historical Thesaurus meaning code will allow more accurate automatic evaluation of word groupings which might constitute the kinds of concept that LDNA is looking to identify. The exact method in which this might be employed in the development of the HRI’s parser software is something the project will explore in the coming months.

One of the most interesting ways in which the Historical Thesaurus can be used in the Linguistic DNA project is to contrast the concepts constructed by the parser with the categories of the Thesaurus. This will allow the team to identify where the automatically identified concepts and their associated lexis accord with the decisions made manually by a team of lexicographers. Differences will, therefore, potentially provide areas for further research, perhaps indicating facets of the parser which can be improved or instigating evaluation of lexicographical data in light of the findings from the EEBO and ECCO textual corpora.

The Glasgow-based team working on Research Theme 3 is also interested in approaching the Thesaurus categories from a more statistical viewpoint, looking for significant shifts in the size of the vocabulary associated with Thesaurus semantic categories, investigating the words involved in the context of their use in EEBO.

These are strong starting points for the use of the Historical Thesaurus as part of the investigation procedures of the Linguistic DNA project, though they are not exhaustive and the project team is always open to new angles of employing EEBO data and Thesaurus data in combination.

Liest thou, or hast a Rewme? Getting the best from VARD and EEBO

This post from August 2015 continues the comparison of VARD and MorphAdorner, tools for tackling spelling variation in early modern English. (See earlier posts here and here.) As of 2018, data on our public interface was prepared with an updated version of MorphAdorner and some additional curation from Martin Mueller at NorthWestern.

This week, we’ve replaced the default VARD set-up with a version designed to optimise the tools for VARD. In essence, this includes a lengthier set of rules to guide the changing of letters, and lists of words and variants that are more suited to the early modern corpus.

It is important to bear in mind that the best use of VARD involves someone ‘training’ it, supervising and to a large extent determining the correct substitutions. But because Linguistic DNA is tackling the whole EEBO-TCP corpus, and the mass of documents within it is far from homogenous, it is difficult to optimise it effectively.

Doth VARD recognise the second-person singular?

A first effort with the EEBO set-up was to review the understanding formed about how VARD works in relation to verb conjugations for the second and third persons singular. A custom text was written to test the output (using the 50% threshold for auto-normalisation as previously):

If he lieth thou liest. When she believeth thou leavest. 
If thou believest not, he leaveth. Where hast thou been? 
When hadst thou gone? Where hath he walked? 
Where goest thou? Where goeth he?
What doth he? What doeth he? What dost thou? 
What doest thou? What ist? What arte doing?

Most of the forms were modernised just as described in the previous post. However, some of the output gave cause for concern. In the first sentence, “liest” became “least”. Further on “goest” became “goosed”, “doest” was accepted as a non-variant, while both “hast” and “dost” were highlighted as unresolved variants. This output can be explained by looking at the variant and word lists and the statistical measures VARD uses.

VARD’s use of variants and word frequencies

Scrutinising the word and variant lists within the EEBO set-up showed that although the variant list recorded “doest” as an instance of “dost”, “doest” and not “dost” appeared in the word list, overriding that variant. Similarly, “ha’st” appears in the variant list as a form of “hast”, but “hast” is not in the word list. It is not difficult to add items to the word list, but the discrepancies in the list contents are surprising. In fact, it might be more appropriate for VARD to record “doest” as a variant of “do”, and “ha’st” of “have”.

For “liest”, the correct variant and word entries are present so that “liest” can be amended to “lie”, giving a known variant [KV] recall score of 100% (indicating this is not a known variant form of any other word). However, the default parameters (regardless of the F-score) favour “least” because that amendment strongly satisfies the other three criteria: letter replacement [LR] (the rules), phonetic matching [PM], and edit distance [ED]. Until human judgment intervenes with the weighting, “least” has the better statistical case. (Much the same applies to “goest” and “goosed”.)

In VARD’s defence, one need only intervene with any of the “-st” verb endings in the text once (before triggering the auto-normalisation process) for the weighting to shift in favour of “liest”. VARD learns well.

Rewme: space, cold, or dominion?

One of the ‘authentic’ EEBO extracts we’ve been testing with is taken from a medical text, A rich store-house or treasury for the diseased, 1596 (TCP A13300). As mentioned in a previous post, employing VARD’s automated normalisation with the default 50% threshold, references to “Rewme” becomes “Room”. Looking again at what is happening beneath the surface, the first surprise is that there is an entry for “rewme” in the variant list, specifying it as a known variant of “room”. This is unsatisfying with regard to EEBO-TCP: a search of the corpus shows that the word form “rewme” appears in 89 texts. Viewing each instance through Anupam Basu’s keyword-in-context interface shows that in 84 texts, “rewme” is used with the meaning “rheum”. Of the other five texts, one is Middle English biblical exegesis (attributed to John Purvey); committed to print as late as 1550, the text repeatedly uses “rewme” with the sense “realm” or “kingdom” (both earthly and divine). The remaining four were printed after 1650 and are either political or historical in intent, similarly using “rewme” as a spelling of “realm”. Nowhere in EEBO-TCP does “rewme” appear with the sense “room”. However, removing it from the known variants (by setting its value to zero) and adding new variant entries for realm and rheum does not result in the desired auto-normalisation: The fact that both realm and rheum are candidates means their KV recall score is halved (50%). At the same time, the preset frequencies strengthen room’s position (309) compared with realm (80) and rheum (50). In fact, the word list accompanying the EEBO set-up seems still to be based on the BNC corpus—featuring robotic (OED 1928) and pulsar (OED 1968) with the same preset frequency as rheum.

So what does this mean for Linguistic DNA?

Again, it is possible to intervene with instances like rewme, whether through the training interface or by manipulating the frequencies. But it is evident that the scale of intervention required is considerable, and it is not obvious that telling VARD that rewme is rheum about 90% of the time that it occurs in EEBO-TCP and realm 10% of the time will have any impact in helping the auto-normalisation process to know when and where to distribute the forms in practice.

The frustrating thing is that the distribution is predictable: in a political text, it is normally “realm”; and in a medical text, it is “rheum”. But VARD seems to have no mechanism to recognise or respond to the contexts of discourse that would so quickly become visible with topic modelling. (Consider the clustering of the four humours in early modern medicine, for example.) I have a feeling this would be where SAMUELS and the Historical Thesaurus come in… if only SAMUELS didn’t rely on VARD’s prior intervention!

_{Wordcloud image created with Wordle.}