One of the resources which the Linguistic DNA project is drawing on is the Historical Thesaurus of English. Organising every word in the language, present and past, into a hierarchical structure based on word-meaning, the Historical Thesaurus is an invaluable tool for historical semantic research. The data from the Thesaurus will be involved in the internal workings of the parser programme being developed at the Sheffield Humanities Research Institute (HRI), and be present in the annotated EEBO and ECCO corpora with which the parser is working.
Structure of the Thesaurus
At its top level, the Historical Thesaurus breaks the vocabulary of English into three main categories – ‘The external world’, ‘The mental world’, and ‘The social world’. These are further subdivided so that, for example, ‘The external world’ contains within it the categories ‘The earth’, ‘Life’, ‘Health and disease’, ‘People’, and ‘Animals’, amongst others. This subdivision continues to a maximum depth of seven levels, with a category number being assigned at each level. As a result, the category ‘Daily record/journal’, for instance, has the category number 03.09.06.01.02 (noun), comprised of the following steps:
03.09.06.01 Written record
03.09.06.01.02 Daily record/journal
This category, then, contains all the recorded words for a journal (journal, day-book, diary, memorial, ephemeris, diurnal, journal-book, diet-book), accompanied by the date ranges in which those words are known to have been used.
Each of the seven main category levels also contains subcategories where required, allowing an exceptionally fine-grained organisation of the semantic content of the language. The Thesaurus provides, therefore, a rich seam to be mined for information on lexical and conceptual development in the English-speaking world for the past two millennia.
Since the completion of the first edition of the Thesaurus in 2009, projects have begun to drill down into its data. A major project, Mapping Metaphor with the Historical Thesaurus, investigated every word in the Thesaurus in order to identify systematic metaphorical links between categories. Its primary output is a complete ‘metaphor map’ of the language, which provides fascinating insights into the ways in which certain concepts are discussed in terms of others. It also demonstrates strikingly just how prevalent metaphor is in the language at the level of individual words.
The SAMUELS project utilised the Thesaurus in an entirely different way, aiming to create semantic tagging software capable of labelling every word in a text with the code of the category in which that word sense can be found in the Thesaurus. This is no easy feat, given that some words have several hundred potential meanings – ‘set’, for example, has 345 entries (not including those where it is part of a multi-word phrase), whilst ‘run’ has 302. The semantic tagging tool was created and tested by a consortium of researchers based at the Universities of Lancaster, Glasgow, Huddersfield, and the University of Central Lancashire. It is currently the only software capable of assigning word meanings based on dating information, and this diachronic tagging ability allows it to be used on texts such as those contained in EEBO and ECCO with a high degree of accuracy. The tagged Hansard corpus, comprising all the speeches made in the British Houses of Parliament between 1803 and 2005, is publicly available via Mark Davies’ corpus website at Brigham Young University, with the tagged EEBO corpus to follow.
The Historical Thesaurus and Linguistic DNA
The output of the SAMUELS project forms a large part of the input to Linguistic DNA, in the form of the semantically-tagged EEBO corpus. It is hoped that the annotation of every word in the EEBO corpus with a Historical Thesaurus meaning code will allow more accurate automatic evaluation of word groupings which might constitute the kinds of concept that LDNA is looking to identify. The exact method in which this might be employed in the development of the HRI’s parser software is something the project will explore in the coming months.
One of the most interesting ways in which the Historical Thesaurus can be used in the Linguistic DNA project is to contrast the concepts constructed by the parser with the categories of the Thesaurus. This will allow the team to identify where the automatically identified concepts and their associated lexis accord with the decisions made manually by a team of lexicographers. Differences will, therefore, potentially provide areas for further research, perhaps indicating facets of the parser which can be improved or instigating evaluation of lexicographical data in light of the findings from the EEBO and ECCO textual corpora.
The Glasgow-based team working on Research Theme 3 is also interested in approaching the Thesaurus categories from a more statistical viewpoint, looking for significant shifts in the size of the vocabulary associated with Thesaurus semantic categories, investigating the words involved in the context of their use in EEBO.
These are strong starting points for the use of the Historical Thesaurus as part of the investigation procedures of the Linguistic DNA project, though they are not exhaustive and the project team is always open to new angles of employing EEBO data and Thesaurus data in combination.