April | 2016 | Linguistic DNA

On Friday 8 April 2016, Susan Fitzmaurice and Seth Mehl attended Diachronic corpora, genre, and language change at the University of Nottingham, where Seth gave a paper entitled Automatic genre identification in EEBO-TCP: A multidisciplinary perspective on problems and prospects. The event featured researchers from around the globe, exploring issues in historical data sets; the nature of genre and text types; and modelling diachronic change.

The day’s plenary speeches were engaging and insightful: Bethany Gray spoke about academic writing as a locus of linguistic change, in contrast to the common expectation that change originates in spoken language. This is particularly relevant for those of us working with older historical data, such that written language is our only evidence for change. Thomas Gloning described the Deutsche Textarchiv, and in particular the recent addition to that corpus of the Dingler Corpus, an essential record of written scientific German representing 1820 to 1932. Gloning presented the useful definition of text types or genres as ‘traditions of communicative action’. In analysing such text types, or traditions, it is possible to map syntax and lexis to text functions and topics, though Gloning cautions that some of the most important elements of such mapping are not currently achievable by machines. This is a careful, valuable perspective and approach, which relates to our own (as discussed below).

Other research papers included a presentation by Fabrizio Esposito who, like the Linguistic DNA project, is using distributional semantic methods. His work looks at recent change in White House Press Briefings. Bryan Jurish presented DiaCollo, a powerful tool for analysing and visualising collocation patterns as they change over time in very large data sets. Vaclav Brezina analysed lexical meaning in EEBO-TCP by measuring differences in collocation patterns across overlapping, sliding diachronic windows.

What did LDNA contribute?

LDNA is asking whether specific concepts emerge uniquely in particular genres, and whether and how those concepts are then adopted and adapted in other genres. Genre is a fuzzy concept, representing categories of texts. Such categories are characterised by formal features such as print layout, phonetics, morphosyntax, lexis, and semantics; and functional features such as purpose of composition, reader expectations, and social and cultural contexts. It is productive to distinguish approaches to genre in different contexts. For Early Modern Studies, categories may be inherited in the canon, and questioned and explored in relation to literature, history, or philosophical or cultural studies; corpus linguistics, often seeks a scientifically reproducible approach to genre and aims to learn about language and variation; while Natural Language Processing (NLP)often aims to engineer tools for solving specific tasks. At the Nottingham conference, Seth illustrated his remarks by reflecting on Ted Underwood’s work automatically identifying genres in HathiTrust texts via supervised machine learning. He then laid out the project’s plan of investigating genre (or text types) by categorising Early Modern texts using the outputs of the LDNA processor, alongside other formal text features. This relates to Gloning’s aforementioned assertion that text topic and function might be mapped onto syntax and lexis; in our case, it is a combined mapping of discursive topics or conceptual fields, lexis, morphosyntax, and additional formal features such as the presence of foreign words or the density of punctuation or parts of speech that will allow us to group texts into categories in a relatively data-driven way.

The conference was very well organised by Richard J. Whitt, with a lovely lunch and dinner in which attendees shared ideas and dug further into linguistic issues. Susan and Seth were delighted to participate.

In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.

Semantics and concepts

The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.

My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).

Operations

Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.

Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.

Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.

All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.

Linguistic DNA

Modelling concepts and semantic change

Monthly Archives: April 2016

Conference report: Diachronic corpora and genre in Nottingham

What did LDNA contribute?

LDNA’s first year: Reflections from RA Seth Mehl

Semantics and concepts

Operations