Proximity Data II: Co-occurrence and distance measurements

In a previous post, we addressed proximity data by defining proximity, term, and co-occurrence. In this post, we weigh specific options for measuring co-occurrence. In particular, we look at an array of distance measurements or windows for co-occurrence. In the Linguistic DNA project, we will be experimenting with many of the options and approaches below. Options for measuring co-occurrence and distance include the following.

Text Co-occurrence

Measures of text co-occurrence are generally used to measure similarity between texts. Put simply, a word frequency list is generated for each text, and frequency lists can then be compared across texts. Texts with similar frequency lists are considered to contain similar content.

The technical developers for Linguistic DNA have already begun compiling some frequency data using the programming language R. The text mining package that accompanies R includes a measure of text co-occurrence, which can be analysed statistically within R (a matter for a separate blog post). In general, when investigating text co-occurrence, word order is ignored, as are grammatical relationships, including clause and sentence boundaries.

But boundaries can be significant.

Paragraph Co-occurrence

Rather than measuring co-occurrence in a complete text, it is also possible to measure co-occurrence paragraph by paragraph, using a similar approach. In addition to indicating paragraph topic, this approach can also be used for lexical semantics. In Landauer and Dumais’s (1997) approach, two terms that tend to co-occur within a single paragraph are likely to represent similar conceptual fields and even to be near-synonyms. They found this approach to be successful in machine learning of synonyms.

However, such an approach would likely be problematic for Linguistic DNA, as the coding applied to EEBO was not much occupied by paragraphing. Moreover, the deployment and purpose of paragraphs has evolved so that their value as a semantic determiner may not be the same in the 1500s as for modern texts.

Another mechanism is to impose boundaries by counting words to left and right. A variety of ‘windows’, or distances to the left and right, are possible. We discuss them below from the widest window to the narrowest window.

+/- 10 words

Burgess and Lund (1997) describe using a window of +/-10 words in their Hyperspace Analogue to Language. That is, they count all words within the range of 10 words to the left and right of each node. A frequency list is made for the window around each node, and frequency lists can then be compared for different nodes. Burgess and Lund (ibid.) claim psycholinguistic validity for this 10-word window, arguing that this represents the number of words held in working memory by a human listener/reader. They ignore word order and grammatical relations, including clause and sentence boundaries.

A word list with this window could be created with weightings that reflect distance from the node: the term that is 10 words to the left or right of the node can be weighted less than the term immediately to the left or right of the node. Such weighting would presume that words occurring in closer proximity are more likely to be semantically related than words that occur further apart.

Up to +/- 5 words

A window of up to +/- 5 words is quite common, and many researchers have used windows of +/-5 words, +/-3 words, or +/-1 word. That is, of course, not to say that this is the best choice in any or all circumstances. Indeed, it is perhaps computationally heuristic – a count and a unit (the word) that most computing technology can cope with.

2 words to the left

Many part of speech taggers (including MorphAdorner’s) rely on trigrams consisting of a node word and two words to the left (cf. Manning and Schuetze 2001, Chapter 10). This is considered a reliable standard in the field for identifying part of speech. Additional words to the left or right seem to produce diminishing returns in reliably tagging parts of speech. Word order is crucial. Experimentation with this window for semantic studies is limited and there is no reason to presume that this window would be useful for semantic information.

+/- n Content Words

A common step in analysing proximity data is to catalogue all co-occurrences within a window and then remove stop words such as grammatical/function words such as determiners (e.g. a, an, the) and prepositions (e.g. of, in, from). Depending on the research question, researchers may be less interested in grammatical/function words such as determiners and prepositions and more interested in the content/lexical words that co-occur with a node.

A variation on this approach is to strip such stop words from the corpus data first, and then to catalogue the n (e.g. 5) content/lexical words to the right and left of the node. As far as we’re aware, this approach has not been employed in published studies.

Assessing the Proximity Windows

Which window is best? That depends in part on your research questions. Larger windows seem to be better for some things, and smaller windows seem to be better for some things – and incorporating grammatical relationships into the window may be best of all for some questions. According to Sahlgren (2006; cf. Heylen et al. 2015) and Turney and Pantel (2010), extending proximity measurement to larger windows such as document co-occurrence (disregarding word order) is most useful for modelling syntagmatic and associative relations (or relational similarity) such as that between ‘doctor’ and ‘hospital’ or ‘car’ and ‘drive’. Proximity measurements using a narrow window can best indicate paradigmatic relations (or attributional similarity) like that between near-synonyms ‘hospital’ and ‘clinic’. Measuring grammatical relationships improves semantic findings for proximity measurements using narrow windows. Indeed, incorporating syntactic data can be extremely valuable for lexical semantic investigations, but, alas, it is not always possible.

Moving Forward

Linguistic DNA is beginning to design an automated method to draw proximity data from EEBO and ECCO. We call this automated method a semantic parser, or, with varying degrees of irony, a magic parsing box. This parser is being built one step at a time, tested on a dataset, and then assessed and developed further. At this stage, therefore, we are working with a proto-parser.

The first step for the proto-parser is to index the simplest lexical co-occurrences at a window of +1, -1, and the aggregate +/-1, for every word (type) in a small, random sub-sample of EEBO. Such a small window may or may not be terribly informative for lexical semantics – we’ll see – but it is a practical step in the prototyping process. From there, we will build up to measure more extensive co-occurrence data like that described above, and to incorporate the types of lexical relationships described in the previous post. We plan to build up in independent steps, assessing and evaluating the results of each step, what we can learn from it and what we can’t. We’ll document that process here on the blog. Ultimately, we aim to aggregate and weight these proximity measures as part of the broader goal of profiling each word in EEBO and ECCO.

Works Cited

Burgess, Curt and Kevin Lund. Modelling Parsing Constraints with High-dimensional Context Space. Language and Cognitive Processes 12 (2/3), 177–210.

Heylen, K., T. Wielfaert, D. Speelman, D. Geeraerts. 2015. Monitoring Polysemy: Word Space Models as a Tool for Large-Scale Lexical Semantic Analysis. Lingua 157, 153-72.

Landauer, Thomas K. and Susan T. Dumais. 1997. A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review 104 (2), 211-40.

Manning, Christopher and Hinrich Schuetze. 2001. Foundations of statistical natural language processing. Boston: MIT Press.

Sahlgren, M. 2006. M. The Word-Space Model: Using Distributional Analysis to Represent Syntagmatic and Paradigmatic Relations Between Words in High-dimensional Vector Spaces. (Ph.D. dissertation), Department of Linguistics, Stockholm University.

Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141-188.

Linguistic DNA

Modelling concepts and semantic change

Proximity Data II: Co-occurrence and distance measurements

2 thoughts on “Proximity Data II: Co-occurrence and distance measurements”