A Theoretical Background to Distributional Methods (pt. 2 of 2)

Introduction

In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.

Means of analysis

The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.

Other links: Psycholinguistics and Discourse Analysis

The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.

There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’

In conclusion

In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.