Tag Archives: ECCO

Looking back, looking forward: Linguistic DNA in 2016 and 2017

As we move into 2017, we’ve been looking back at achievements in 2016, and ahead to what we aim to achieve in the coming year.

2016 was an outwardly busy year as we travelled to Bruges, Essen, Krakow, Lausanne, Leeds, Brighton, Murcia, Nottingham, Paris, Saarbrucken, and Utrecht, sharing more of our thinking and early data with different audiences. Closer to “home”, we benefitted from the exchange of ideas with LDNA-hosted panels at Sheffield DH Congress and our second methodological workshop in Sussex. In 2017, we will be focusing back on our interface development and some more in-depth research, though we intend to be present at DH, SHEL, ICAME and SHARP, in order to continue some fruitful conversations.

On the blog, we have been reflecting on representativeness and the nature of EEBO-TCP. We’ve also documented our decision not to use ECCO’s OCR data to analyse eighteenth century print. You can expect to hear about the alternative 18th century datasets we’re choosing to work with later in 2017.

During the Autumn, the LDNA researchers collaborated on two articles about the project, its theory and praxis, both (hopefully) to be published this year following peer review. Generating examples from each research theme based on our early data and tying these together effectively was an enjoyable challenge, and we have already used the draft of one piece as part of our briefing materials for upcoming MA placements at The Digital Humanities Institute | Sheffield (formerly known as HRI Digital).

In the past six months, the Sheffield team have captured funding for two additional applications of the Linguistic DNA “concept modelling” tools:

  • The ESRC project Ways of Being in a Digital Age combines our quantitative insights with a qualitative literature survey of academic publications. Scheduled to inform the ESRC’s next programme of digital society funding, this impact-full study has compelled us toward rapid prototype development. The interface being put together to serve ‘WoBDA’ colleagues will also form the kernel of the subsequent LDNA workbench.
  • From next month, we are involved in another funded impact-related project, collaborating with the University of Leeds to explore the conceptual structure of millions of YouTube video comments on the theme of militarisation, as part of a larger project funded by the Swedish Research Council. This is a six-month commitment, bringing in a further research associate to theorise what’s involved in applying our measures to some very different data.

We also have three significant applications in place for other pots of funding, including Horizon 2020 collaborations, attesting confidence about our nascent processes and the multifarious opportunities for their application and impact.

Meanwhile, Glasgow has been using the present word co-occurrence data to develop its methodology for investigating processor data from the perspective of key Historical Thesaurus categories. We have continued to develop analysis of Thesaurus categories, looking for those which show abnormal instances of growth or decline; a provisional methodology for establishing statistical ‘baselines’ has been plotted out which is now being implemented and refined. Further possibilities are being tested, such as amalgamating data across whole layers of the HT hierarchy rather than by individual category, and the effects of separating out part of speech within categories or layers.

Chart showing frequency of stem "transl-" in ECCO OCR as % of TCP.

Experimenting with the imperfect: ECCO & OCR

When the Linguistic DNA project was first conceived, we aimed to incorporate more than 200 000 items from Eighteenth Century Collections Online (ECCO). Comparing findings for one portion of ECCO that has been digitised in different ways, this 2016 blogpost details why that ambition proved impractical. The public database uses ECCO-TCP as its main eighteenth-century source. Continue reading

LDNA’s first year: Reflections from RA Seth Mehl

In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.

Semantics and concepts

The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.

My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).


Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.

Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.

Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.

All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.