Author Archives: Fraser Dallachy

The Edge

LDNA at Digital Humanities Congress 2016, Sheffield

LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:


Continue reading

Digital Humanities 2016, Kraków

Conference reflections jointly written with Justyna Robinson

Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.

Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.

Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.

Poster

At the poster session

Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. [1]

Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised.  The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.

However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.

For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.

———

1. Studying Linguistic Changes on 200 Years of Newspapers, Vincent Buntinx, Cyril Bornet, Frédéric Kaplan (EPFL (École polytechnique fédérale de Lausanne), Switzerland)

QPL Semantic Spaces Workshop, University of Strathclyde

The workshop day, titled ‘Semantic Spaces at the Intersection of NLP, Physics, and Cognitive Science’, was part of a larger Quantum Physics and Logic (QPL) conference held at the University of Strathclyde. The workshop focussed on computational approaches to modelling semantics and semantic relations in language. The day was divided into three parts: the first session was concerned with the application of principles derived from physics and formal logic to the expression of linguistic phenomena; the middle section segued this into consideration of Natural Language Processing (NLP); whilst the final section covered cognitive science and cognitive linguistics’ views of semantics. My interest in attending this was to get an idea of the approach which ‘hard science’ is taking to aspects of semantics which overlap with the research of the Linguistic DNA project, as well as to see if there were anything that we might be able to apply to our own work.

A subject which struck a chord was the discussion of vector space modelling, which is near the top of our list of topics to be implemented as we approach the point where we move from identifying word pairs to establishing clusters of related words. The subject was touched on in several of the papers, with particular relevance to the final paper of the day, in which Stephen McGregor described work done by himself and colleagues to locate ‘subspaces’ within vector space models which delineate an analogical relationship between different words. Beginning with an SAT-style statement that ‘dog is to cat as puppy is to kitten’, the paper used PMI measurements as a basis on which to plot these words in vector space, and then examined the geometrical relationship of the points to demonstrate how it might be possible to define a subspace within the vector space and thus automatically identify the positions of analogical partners words or concepts.

The NLP section of the workshop was dominated by Categorical Compositional Distributional semantics and the ways in which researchers using this approach are mapping the emergence of meaning from syntactic structure. The morning’s physics papers had discussed in some detail the application of formal logic expressions to sentence semantics, describing, for example, the way in which a transitive verb combines with subject and object nouns to ‘output’ the meaning of the sentence. These papers applied this theoretical approach to specific sentence elements, such as Dimitri Kartsaklis’ analysis of coordination and Mehrnoosh Sadrzadeh’s study of quantifiers. To me, these papers chimed with work Seth has been doing, considering the importance of handling different parts of speech in different ways during processing; they made clear the flaws in the so-called ‘bag-of-words’ approach to computational linguistics and highlighted that, in the long run, consideration of syntax should be an important part of the kind of computational semantics we’re undertaking.

Also of special interest was Peter Gärdenfors’ consideration of domains as components of word meanings. In the main, the point was illustrated through consideration of nouns (although touching on other parts of speech), asking whether it might be helpful to think of words as fundamentally dependent on domains such as place, shape, and temperature (so that, for example, ‘round’ maintains some connection to the sense of a curve in physical space even when not used as a noun, whilst most ‘verbed’ nouns retain important connections to their parent’s referent). Whilst bearing mostly indirect applicability to current LDNA work, this discussion is important food for thought, especially for its potential impact on the encyclopedic aspect of a word’s semantics in context.

The workshop provided a thought-provoking day to a relative outsider, offering an important viewpoint on the other approaches to semantics which are being pioneered outside of arts faculties, an awareness which can only strengthen our own work. I’d like to thank the organisers and contributors to the workshop for a hugely interesting and intellectually engaging day.

University of Sussex--view of campus from above

Workshop Reflections

University of Sussex--view of campus from above

A fortnight ago, our first methodology workshop was held at the University of Sussex. It was a full programme and productive for the project team with lots of opportunities for us to test out our thinking about how we move forward, and it has given us plenty to think about. We can perhaps best summarise some of the overarching themes by starting with the questions we began with and some more that were raised during the event.

Top in our minds going in were questions such as ‘What is a concept?’ How will we recognise one when we find it? How exactly do we (should we) go about finding concepts in the first place? Our thinking on these matters has taken a step forward thanks to these discussions, and the next couple of blog posts are already in preparation to explore what we’ve learned and the directions that this suggests for us in the coming months. Suggestions that were raised included investigating synonymous terms, and the relationships between onomasiological conceptual fields. Our ideas are to some extent still forming as we consider these suggestions afresh and work on developing our ideas in the process.

Another major question was of the importance of marking up and pre-processing the data before we begin to run our own processes. The issue of spelling regularisation has formed a large part of our initial work on the data of EEBO, with our comparison of the VARD and MorphAdorner tools being documented in several earlier posts. It is not only spelling that is at issue; pre-processing texts with MorphAdorner and the Historical Thesaurus Semantic Tagger also offer layers of annotation. As a result, because our new processes can be designed to take in multiple types of input (e.g. lemma, part of speech) or combinations of these, we were curious to learn what workshop participants thought we should prioritise.

There was extensive discussion about the extent to which the text should be pre-processed before being loaded into the parser, and there was some disagreement over whether spelling regularisation is itself a necessary step or whether it ought not to be used because it skews word frequency counts. Whether or not an individual method of pre-processing proves fruitful – or, indeed, if it is better to process the raw text itself – it is ultimately to our benefit to have explored these avenues and to be able to say with authority what has been successful and what has not.

A final core point was the question of the technology which we plan to use and how we can build on the most effective tools already produced for linguistic research.  As the Digital team at Sheffield (@HRIDigital) are beginning work on building the parser, we wanted to consider what parts of that process should be created from scratch and what parts can be effectively accomplished using software which already exists.

In the course of presentations and discussions, participants drew our attention to a variety of tools. We have prioritised these for our attention, including those for identifying synonymy and polysemy, word sense disambiguation, novel sense detection, and topic identification. The result is fresh ideas for some technologies to investigate, and so the research associates have got to work learning about tools such as Gensim, HiDEx (High Dimensional Explorer), and BlackLab.

From the very start, we have been clear that we want to be able to understand and explain as much as possible how our processes work, rather than create something which acts as a ‘black box’, devouring input and producing results in a manner that cannot be evaluated or understood. Conducting these discussions while we’re still in the design phase has helped reinforce the value of that ideal for the team.

We firmly believe that drawing on the expertise and experience of the academic community in our field will make Linguistic DNA a stronger project. The workshop helped to progress our thinking, and we’d like to thank again everyone who attended the event—your input is hugely appreciated, and we look forward to sharing with you where it takes us!

The Historical Thesaurus of English and its Related Projects

One of the resources which the Linguistic DNA project is drawing on is the Historical Thesaurus of English. Organising every word in the language, present and past, into a hierarchical structure based on word-meaning, the Historical Thesaurus is an invaluable tool for historical semantic research. The data from the Thesaurus will be involved in the internal workings of the parser programme being developed at the Sheffield Humanities Research Institute (HRI), and be present in the annotated EEBO and ECCO corpora with which the parser is working.

 

Structure of the Thesaurus

 

At its top level, the Historical Thesaurus breaks the vocabulary of English into three main categories – ‘The external world’, ‘The mental world’, and ‘The social world’. These are further subdivided so that, for example, ‘The external world’ contains within it the categories ‘The earth’, ‘Life’, ‘Health and disease’, ‘People’, and ‘Animals’, amongst others. This subdivision continues to a maximum depth of seven levels, with a category number being assigned at each level. As a result, the category ‘Daily record/journal’, for instance, has the category number 03.09.06.01.02 (noun), comprised of the following steps:

03                                Society

03.09                           Communication

03.09.06                      Record

03.09.06.01                 Written record

03.09.06.01.02            Daily record/journal

This category, then, contains all the recorded words for a journal (journal, day-book, diary, memorial, ephemeris, diurnal, journal-book, diet-book), accompanied by the date ranges in which those words are known to have been used.

Each of the seven main category levels also contains subcategories where required, allowing an exceptionally fine-grained organisation of the semantic content of the language. The Thesaurus provides, therefore, a rich seam to be mined for information on lexical and conceptual development in the English-speaking world for the past two millennia.

 

Related Projects

 

Since the completion of the first edition of the Thesaurus in 2009, projects have begun to drill down into its data. A major project, Mapping Metaphor with the Historical Thesaurus, investigated every word in the Thesaurus in order to identify systematic metaphorical links between categories. Its primary output is a complete ‘metaphor map’ of the language, which provides fascinating insights into the ways in which certain concepts are discussed in terms of others. It also demonstrates strikingly just how prevalent metaphor is in the language at the level of individual words.

The SAMUELS project utilised the Thesaurus in an entirely different way, aiming to create semantic tagging software capable of labelling every word in a text with the code of the category in which that word sense can be found in the Thesaurus. This is no easy feat, given that some words have several hundred potential meanings – ‘set’, for example, has 345 entries (not including those where it is part of a multi-word phrase), whilst ‘run’ has 302. The semantic tagging tool was created and tested by a consortium of researchers based at the Universities of Lancaster, Glasgow, Huddersfield, and the University of Central Lancashire. It is currently the only software capable of assigning word meanings based on dating information, and this diachronic tagging ability allows it to be used on texts such as those contained in EEBO and ECCO with a high degree of accuracy. The tagged Hansard corpus, comprising all the speeches made in the British Houses of Parliament between 1803 and 2005, is publicly available via Mark Davies’ corpus website at Brigham Young University, with the tagged EEBO corpus to follow.

 

The Historical Thesaurus and Linguistic DNA

 

The output of the SAMUELS project forms a large part of the input to Linguistic DNA, in the form of the semantically-tagged EEBO corpus. It is hoped that the annotation of every word in the EEBO corpus with a Historical Thesaurus meaning code will allow more accurate automatic evaluation of word groupings which might constitute the kinds of concept that LDNA is looking to identify. The exact method in which this might be employed in the development of the HRI’s parser software is something the project will explore in the coming months.

One of the most interesting ways in which the Historical Thesaurus can be used in the Linguistic DNA project is to contrast the concepts constructed by the parser with the categories of the Thesaurus. This will allow the team to identify where the automatically identified concepts and their associated lexis accord with the decisions made manually by a team of lexicographers. Differences will, therefore, potentially provide areas for further research, perhaps indicating facets of the parser which can be improved or instigating evaluation of lexicographical data in light of the findings from the EEBO and ECCO textual corpora.

The Glasgow-based team working on Research Theme 3 is also interested in approaching the Thesaurus categories from a more statistical viewpoint, looking for significant shifts in the size of the vocabulary associated with Thesaurus semantic categories, investigating the words involved in the context of their use in EEBO.

These are strong starting points for the use of the Historical Thesaurus as part of the investigation procedures of the Linguistic DNA project, though they are not exhaustive and the project team is always open to new angles of employing EEBO data and Thesaurus data in combination.