Tag Archives: Fraser

From Data to Evidence (d2e): conference reflections

HelsinkiFraser and Iona report (November 2015):

Six members of the Linguistic DNA team were present at the recent d2e conference held by the VARIENG research unit at the University of Helsinki, Finland. The focus of the conference was on tools and methodologies employed in corpus linguistics, whilst the event took for its theme ‘big data, rich data, uncharted data’. The conference offered much food for thought, raising our awareness of the tools and methods employed by other researchers in similar fields. Frequently it was clear that despite the differences between the goals of, for example, sociolinguistics and historical semantics, the knowledge and approach towards data taken by one could be effectively and productively applied to another.

The conference’s plenary speeches were of particular interest. Tony McEnery delineated potential limitations of corpus data and its analysis. His call for researchers to remain aware of the limitations of their data struck a chord with our findings from close examination of EEBO data in its raw and processed forms. One of his main conclusions was the importance of conducting cyclical researchanalysing the data with software tools and then returning to the data itself to verify the validity of the findings. LDNA is set up to follow this approach, and Professor McEnery’s presentation reaffirmed its importance. Plenaries by Jane Winters and Päivi Pahta looked further into working with historical data andin the latter particularlyhistorical linguistic data, whilst a fascinating presentation by Mark Davies emphasised the importance of corpus size in the type of research which we are undertaking.

LDNA is also taking an active interest in innovative approaches to data analysis and visualisation. Demonstrating software, Gerold Schneider, Eetu Mäkelä, and Jonathan Hope each showcased new tools for representing historical language data and wrangling with metadata. As we progress in our thinking about the kinds of processing which will allow us to identify concepts in our data, we are always on the lookout for ideas and methodological developments which might help us to improve our own findings.

Several research papers connected with the interests of LDNA, especially when they adhered closely to the conference’s theme of exploring large and complex datasets in ways which reveal new patterns in the data. James McCracken’s presentation on adding frequency information to the Oxford English Dictionary was very exciting for the possibilities it could open up to future historical linguistics. (We’ve blogged before about the drawback of not having relevant frequency data when using tools like VARD.) Meanwhile, the techniques used to track change in words’ behaviour, with different dimensions of semantic evolution scrutinised by Hendrik De Smet (for Hansard), Gerold Schneider (in COHA), and Hannah Kermes and Stephania Degaetano-Ortlieb of Saarland University (working with the Royal Scientific Corpus) were not only intrinsically fascinating but provide useful pointers towards the depth and complexity of linguistic features LDNA will need to consider. We will also aim to keep in view Joseph Flanagan’s insistence that linguistic studies should aim for reproducibility, an insistence aided (for those who code with R) by the suite of tools he recommended.

The d2e conference packed a lot into a few days, creating an intense and productive atmosphere in which participants could meet, exchange ideas, and become more aware of the scope of others’ work in related fields. We enjoyed the conversations around our own poster, and much appreciated the hospitality throughout. It was a great opportunity for the LDNA team, providing more invaluable input to our thought and approach to our work.


Abstracts from the conference are available from the d2e pages on the Varieng website.

Anni Aarinen provides a write-up of McEnery’s keynote.

Glasgow-based LDNA member Brian Aitken has written up his d2e experience on the Digital Humanities blog.

University of Sussex--view of campus from above

Workshop Reflections

University of Sussex--view of campus from above

A fortnight ago, our first methodology workshop was held at the University of Sussex. It was a full programme and productive for the project team with lots of opportunities for us to test out our thinking about how we move forward, and it has given us plenty to think about. We can perhaps best summarise some of the overarching themes by starting with the questions we began with and some more that were raised during the event.

Top in our minds going in were questions such as ‘What is a concept?’ How will we recognise one when we find it? How exactly do we (should we) go about finding concepts in the first place? Our thinking on these matters has taken a step forward thanks to these discussions, and the next couple of blog posts are already in preparation to explore what we’ve learned and the directions that this suggests for us in the coming months. Suggestions that were raised included investigating synonymous terms, and the relationships between onomasiological conceptual fields. Our ideas are to some extent still forming as we consider these suggestions afresh and work on developing our ideas in the process.

Another major question was of the importance of marking up and pre-processing the data before we begin to run our own processes. The issue of spelling regularisation has formed a large part of our initial work on the data of EEBO, with our comparison of the VARD and MorphAdorner tools being documented in several earlier posts. It is not only spelling that is at issue; pre-processing texts with MorphAdorner and the Historical Thesaurus Semantic Tagger also offer layers of annotation. As a result, because our new processes can be designed to take in multiple types of input (e.g. lemma, part of speech) or combinations of these, we were curious to learn what workshop participants thought we should prioritise.

There was extensive discussion about the extent to which the text should be pre-processed before being loaded into the parser, and there was some disagreement over whether spelling regularisation is itself a necessary step or whether it ought not to be used because it skews word frequency counts. Whether or not an individual method of pre-processing proves fruitful – or, indeed, if it is better to process the raw text itself – it is ultimately to our benefit to have explored these avenues and to be able to say with authority what has been successful and what has not.

A final core point was the question of the technology which we plan to use and how we can build on the most effective tools already produced for linguistic research.  As the Digital team at Sheffield (@HRIDigital) are beginning work on building the parser, we wanted to consider what parts of that process should be created from scratch and what parts can be effectively accomplished using software which already exists.

In the course of presentations and discussions, participants drew our attention to a variety of tools. We have prioritised these for our attention, including those for identifying synonymy and polysemy, word sense disambiguation, novel sense detection, and topic identification. The result is fresh ideas for some technologies to investigate, and so the research associates have got to work learning about tools such as Gensim, HiDEx (High Dimensional Explorer), and BlackLab.

From the very start, we have been clear that we want to be able to understand and explain as much as possible how our processes work, rather than create something which acts as a ‘black box’, devouring input and producing results in a manner that cannot be evaluated or understood. Conducting these discussions while we’re still in the design phase has helped reinforce the value of that ideal for the team.

We firmly believe that drawing on the expertise and experience of the academic community in our field will make Linguistic DNA a stronger project. The workshop helped to progress our thinking, and we’d like to thank again everyone who attended the event—your input is hugely appreciated, and we look forward to sharing with you where it takes us!

Welcome to the Linguistic DNA blog!

Linguistic DNA cloud (created with Tagul)

The Linguistic DNA blog is a space for those working on the project to reflect on methodology, findings, and other aspects of the project in an informal way.

Fraser, Iona, and Seth (the research associates) will be taking it in turns to share what we have been working on.  At present, the website is gradually taking shape thanks to the Sheffield team, while Fraser is hard at work drafting conference papers.

Our chain of Linguistic DNA (image above) has been generated from the initial text of our website, using Tagul.