What might distribution tell us about word meaning?
Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.
So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?
Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.
An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.
In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.
Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?
Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?
And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.
By definition, we know that in proximity data, we are observing words that co-occur, which left us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations.
What does distribution tell us about semantic relations?
The Linguistic DNA team assessed what exactly it can determine about semantics based on distributional analysis: from encyclopaedic meaning to specific semantic relations. The following details what distribution can tell us about semantic relations in particular, including synonymy, hyponymy, and co-hyponymy.
In the Natural Language Processing (NLP) sphere, numerous studies have tested the effectiveness of distributional data in identifying semantic relations. Turney and Pantel (2010) provide a useful survey of such studies, many of which involve machine learning, and computer performance on synonymy tests including those found on English language exams. Examples of success on synonymy tests have employed windows of anything from +/-4 words up to +/-150 words, but such studies tend not to test various approaches against each other, and they rarely dissect the notion of synonymy, much less co-hyponymy or other semantic relations.
Only a few studies have tested distributional methods as indicators of specific semantic relations. The Quantitative Lexicology and Variational Linguistics (QLVL) team at KU Leuven has addressed this problem in several papers. For example, Peirsman et al. (2007) looked at evidence for synonymy, hyponymy, and co-hyponymy in proximity data for Dutch. (A hyponym is a word whose meaning is a member of a larger category – for example, a crow and a robin are both types of bird, so crow and robin are both hyponyms of bird, and crow and robin are co-hyponyms of each other, but they are not synonyms of each other). Peirsman et al. looked at raw proximity measures as well as proximity measures that incorporate syntactic dependency information. Their findings demonstrate that in Dutch, synonymy and hyponymy are more readily indicated by proximity analyses that include syntactic dependency. On the other hand, they show that co-hyponymy is most effectively evidenced by raw proximity measures that do not include syntactic information.1 This finding is a startling result, with fascinating implications for linguistic theory. Why should ignoring syntactic information provide better measures of co-hyponymy? Might English be similar? How about Early Modern English?
In Peirsman et al. (ibid.), 6.3% of words that share similar distributional characteristics with a given word, or node, are synonyms with that node, and 4.0% are hyponyms of that node. Put differently, about 94% of words identified by distributional analysis aren’t synonyms, and round 70% of the words elicited in these measures are not semantically related to the node at all. Experienced corpus semanticists will not be surprised by this. But what happens to the majority of words, which aren’t related in any clear way? A computer algorithm will output all significant co-occurrences. Often, the co-occurrences that are not intuitively meaningful are quietly ignored by the researcher. If we are going to ignore such outputs, we must do so explicitly and with complete transparency. But this raises bigger questions: If we trust our methods, why should we ignore counterintuitive outputs? Or are these methods valuable simply as reproducible heuristics? I would argue that we should be transparent about our perspective on our own methods.
Also from QLVL, Heylen et al. (2008a) tests which types of syntactic dependency relations are most effective at indicating synonymy in Dutch nouns, and finds that Subject and Object relations most consistently indicate synonymy, but that adjective modification can give the best (though less consistent) indication of synonymy. In fact, adjective modification can be even better than a combined method using adjective modification and Subject/Object relations. Again, the findings are startling, and fascinating—why would the consideration of Subject/Object relations actually hinder the effective use of adjective modification as evidence of synonymy? The answer is not entirely clear. In a comparable study, Van der Plas and Bouma (2005) found Direct Object relations and adjective modification to be the most effective relations in identifying synonymy in Dutch. Unlike Heylen et al.’s (2008a) findings, Van der Plas and Bouma (2005) found that combining dependency relations improved synonym identification.
Is proximity data more effective in determining the semantics of highly frequent words? Heylen et al. (2008b) showed that in Dutch, high frequency nouns are more likely to collocate within +/-3 words with nouns that have a close semantic similarity, in particular synonyms and hyponyms. Low frequency nouns are less likely to do so. In addition, in Dutch, syntactic information is the best route to identifying synonymy and hyponymy overall, but raw proximity information is in fact slightly better at retrieving synonyms for medium-frequency nouns. This finding, then, elaborates on the finding in Peirsman et al. (2007; above).
How about word class? Peirsman et al. (2008) suggest, among other things, that in Dutch, a window of +/-2 words best identifies semantic similarity for nouns, while +/-4 to 7 words is most effective for verbs.
For Linguistic DNA, it was important to know exactly what we could and couldn’t expect to determine based on distributional analysis. We planned to employ distributional analysis using a range of proximity windows as well as syntactic information.
*Castle Arenberg, in the photo above, is part of KU Leuven, home of QLVL and many of the studies cited in this post. (Credit: Juhanson. Licence: CC BY-SA 3.0.)
Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.
Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press
Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press
Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk. 2008a. Automatic synonymy extraction: A Comparison of Syntactic Context Models. In Verberne, Suzan; van Halteren, Hans; Coppen, Peter-Arno (eds), Computational linguistics in the Netherlands 2007. Amsterdam: Rodopi, 101-16.
Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk. 2008b. Modelling word similarity: An evaluation of automatic synonymy extraction algorithms. In: Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation. Marrakech: European Language Resources Association, 3243-49.
Peirsman, Yves; Heylen, Kris; Speelman, Dirk. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the 2007 Workshop on Contextual Information in Semantic Space Models: Beyond Words and Documents, 9-16.
Peirsman, Yves; Heylen, Kris; Geeraerts, Dirk. 2008. Size matters: tight and loose context definitions in English word space models. In Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 34-41.
Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141-188.
van der Plas, Lonneke and Gosse Bouma. 2005. Syntactic Contexts for finding Semantically Similar Words. In Proceedings of CLIN 04