On Friday 8 April 2016, Susan Fitzmaurice and Seth Mehl attended Diachronic corpora, genre, and language change at the University of Nottingham, where Seth gave a paper entitled Automatic genre identification in EEBO-TCP: A multidisciplinary perspective on problems and prospects. The event featured researchers from around the globe, exploring issues in historical data sets; the nature of genre and text types; and modelling diachronic change.
The day’s plenary speeches were engaging and insightful: Bethany Gray spoke about academic writing as a locus of linguistic change, in contrast to the common expectation that change originates in spoken language. This is particularly relevant for those of us working with older historical data, such that written language is our only evidence for change. Thomas Gloning described the Deutsche Textarchiv, and in particular the recent addition to that corpus of the Dingler Corpus, an essential record of written scientific German representing 1820 to 1932. Gloning presented the useful definition of text types or genres as ‘traditions of communicative action’. In analysing such text types, or traditions, it is possible to map syntax and lexis to text functions and topics, though Gloning cautions that some of the most important elements of such mapping are not currently achievable by machines. This is a careful, valuable perspective and approach, which relates to our own (as discussed below).
Other research papers included a presentation by Fabrizio Esposito who, like the Linguistic DNA project, is using distributional semantic methods. His work looks at recent change in White House Press Briefings. Bryan Jurish presented DiaCollo, a powerful tool for analysing and visualising collocation patterns as they change over time in very large data sets. Vaclav Brezina analysed lexical meaning in EEBO-TCP by measuring differences in collocation patterns across overlapping, sliding diachronic windows.
What did LDNA contribute?
LDNA is asking whether specific concepts emerge uniquely in particular genres, and whether and how those concepts are then adopted and adapted in other genres. Genre is a fuzzy concept, representing categories of texts. Such categories are characterised by formal features such as print layout, phonetics, morphosyntax, lexis, and semantics; and functional features such as purpose of composition, reader expectations, and social and cultural contexts. It is productive to distinguish approaches to genre in different contexts. For Early Modern Studies, categories may be inherited in the canon, and questioned and explored in relation to literature, history, or philosophical or cultural studies; corpus linguistics, often seeks a scientifically reproducible approach to genre and aims to learn about language and variation; while Natural Language Processing (NLP)often aims to engineer tools for solving specific tasks. At the Nottingham conference, Seth illustrated his remarks by reflecting on Ted Underwood’s work automatically identifying genres in HathiTrust texts via supervised machine learning. He then laid out the project’s plan of investigating genre (or text types) by categorising Early Modern texts using the outputs of the LDNA processor, alongside other formal text features. This relates to Gloning’s aforementioned assertion that text topic and function might be mapped onto syntax and lexis; in our case, it is a combined mapping of discursive topics or conceptual fields, lexis, morphosyntax, and additional formal features such as the presence of foreign words or the density of punctuation or parts of speech that will allow us to group texts into categories in a relatively data-driven way.
The conference was very well organised by Richard J. Whitt, with a lovely lunch and dinner in which attendees shared ideas and dug further into linguistic issues. Susan and Seth were delighted to participate.