Monthly Archives: October 2015

Distributional Semantics I: What might distribution tell us about word meaning?


Distributional Semantics I: What might distribution tell us about word meaning?

In a previous post, I asked ‘What is the link between corpus data showing lexical usage, on the one hand, and lexical semantics or concepts, on the other?’ In this post, I’d like to forward that discussion by addressing one component of it: how we observe lexical semantics (or word meaning) via distributional data in texts. That is, how do we know what we know about semantics from distributional data?

Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.

So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?

Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.

An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.

In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.

Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?

Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?

And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.

By definition, we know that in proximity data, we are observing words that co-occur. Which leaves us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations. In the next post (coming soon), I report on exactly that.


Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press

Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.

University of Sussex--view of campus from above

Workshop Reflections

University of Sussex--view of campus from above

A fortnight ago, our first methodology workshop was held at the University of Sussex. It was a full programme and productive for the project team with lots of opportunities for us to test out our thinking about how we move forward, and it has given us plenty to think about. We can perhaps best summarise some of the overarching themes by starting with the questions we began with and some more that were raised during the event.

Top in our minds going in were questions such as ‘What is a concept?’ How will we recognise one when we find it? How exactly do we (should we) go about finding concepts in the first place? Our thinking on these matters has taken a step forward thanks to these discussions, and the next couple of blog posts are already in preparation to explore what we’ve learned and the directions that this suggests for us in the coming months. Suggestions that were raised included investigating synonymous terms, and the relationships between onomasiological conceptual fields. Our ideas are to some extent still forming as we consider these suggestions afresh and work on developing our ideas in the process.

Another major question was of the importance of marking up and pre-processing the data before we begin to run our own processes. The issue of spelling regularisation has formed a large part of our initial work on the data of EEBO, with our comparison of the VARD and MorphAdorner tools being documented in several earlier posts. It is not only spelling that is at issue; pre-processing texts with MorphAdorner and the Historical Thesaurus Semantic Tagger also offer layers of annotation. As a result, because our new processes can be designed to take in multiple types of input (e.g. lemma, part of speech) or combinations of these, we were curious to learn what workshop participants thought we should prioritise.

There was extensive discussion about the extent to which the text should be pre-processed before being loaded into the parser, and there was some disagreement over whether spelling regularisation is itself a necessary step or whether it ought not to be used because it skews word frequency counts. Whether or not an individual method of pre-processing proves fruitful – or, indeed, if it is better to process the raw text itself – it is ultimately to our benefit to have explored these avenues and to be able to say with authority what has been successful and what has not.

A final core point was the question of the technology which we plan to use and how we can build on the most effective tools already produced for linguistic research.  As the Digital team at Sheffield (@HRIDigital) are beginning work on building the parser, we wanted to consider what parts of that process should be created from scratch and what parts can be effectively accomplished using software which already exists.

In the course of presentations and discussions, participants drew our attention to a variety of tools. We have prioritised these for our attention, including those for identifying synonymy and polysemy, word sense disambiguation, novel sense detection, and topic identification. The result is fresh ideas for some technologies to investigate, and so the research associates have got to work learning about tools such as Gensim, HiDEx (High Dimensional Explorer), and BlackLab.

From the very start, we have been clear that we want to be able to understand and explain as much as possible how our processes work, rather than create something which acts as a ‘black box’, devouring input and producing results in a manner that cannot be evaluated or understood. Conducting these discussions while we’re still in the design phase has helped reinforce the value of that ideal for the team.

We firmly believe that drawing on the expertise and experience of the academic community in our field will make Linguistic DNA a stronger project. The workshop helped to progress our thinking, and we’d like to thank again everyone who attended the event—your input is hugely appreciated, and we look forward to sharing with you where it takes us!