Monthly Archives: February 2016

Dr Kris Heylen: Tracking Conceptual Change

In February 2016, Linguistic DNA hosted Dr Kris Heylen as an HRI Visiting Fellow, strengthening our links with KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. This post outlines the scheduled public events.


Next week, the Linguistic DNA project welcomes visiting scholar–and HRI Visiting European FellowDr Kris Heylen of KU Leuven.  

About Kris:

Kris is a researcher based in KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. His research focuses on the statistical modelling of lexical semantics and lexical variation, and more specifically the introduction of distributional semantic models into lexicological research. Next to his fundamental research on lexical semantics, he has also a strong interest in exploring the use of quantitative, corpus-based methods in applied linguistic research with projects in legal translation, vocabulary learning and medical terminology.

During his stay in Sheffield, Kris will be working alongside the Linguistic DNA team, playing with some of our data, and sharing his experience of visualizing semantic change across time, as well as talking about future research collaborations with others on campus. There will be several opportunities for others to meet with Kris and hear about his work, including a lecture and workshop (details below). Both events are free to attend.

Lecture: 3 March

On Thursday 3rd March at 5pm, Kris will give an open lecture entitled:

Tracking Conceptual Change:
A Visualization of Diachronic Distributional Semantics


ABSTRACT (Kris writes):

In this talk, I will present an overview of statistical and corpus-based studies of lexical variation and semantic change, carried out at the research group Quantitative Lexicology and Variational Linguistics (QLVL) in recent years. As a starting point, I’ll take the framework developed in Geeraerts et. al. (1994) to describe the interaction between concepts’ variable lexical expression (onomasiology) and lexemes’ variable meaning (semasiology). Next, I will discuss how we adapted distributional semantic models, as originally developed in computational linguistics (see Turney and Pantel 2010 for an overview), to the linguistic analysis of lexical variation and change.

With two case studies, one on the concept of immigrant in Dutch and one on positive evaluative adjectives in English  (great, superb, terrific, etc.), I’ll illustrate how we have used visualisation techniques to investigate diachronic change in both the construal and the lexical expression of concepts.

All are welcome to attend this guest lecture which takes place at the Humanities Research Institute (34 Gell Street).  It is also possible to come for dinner after the lecture, though places may be limited and those interested are asked to get in touch with Linguistic DNA beforehand (by Tuesday 1st February).

 

Workshop: 7 March

On Monday 7th March, Kris will run an open workshop on visualizing language, sharing his own experiments with Linguistic DNA data. Participation is open to students and staff, but numbers are limited and advance registration is required. To find out more, please email Linguistic DNA (deadline: 4pm, Friday 4th March). Those at the University of Sheffield can reserve a place at the workshop using Doodle Poll.


Anyone who would like the opportunity to meet with Kris to discuss research collaborations should get in touch with him via Linguistic DNA as soon as possible so that arrangements can be made.

A Theoretical Background to Distributional Methods (pt. 2 of 2)

Introduction

In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.

Means of analysis

The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.

Other links: Psycholinguistics and Discourse Analysis

The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.

There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’

In conclusion

In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.

A theoretical background to distributional methods (pt. 1 of 2)

Introduction

When discussing proximity data and distributional methods in corpus semantics, it is common for linguists to refer to Firth’s famous “dictum”, ‘you shall know a word by the company it keeps!’ In this post, I look a bit more closely at the theoretical traditions from which this approach to semantics in contexts of use has arisen, and the theoretical links between this approach and other current work in linguistics. (For a synopsis of proximity data and distributional methods, see previous posts here, here, and here.)

Language as Use

Proximity data and distributional evidence can only be observed in records of language use, like corpora. The idea of investigating language in use reflects an ontology of language—the idea that language is language in use. If that basic definition is accepted, then the linguist’s job is to investigate language in use, and corpora constitute an excellent source of concrete evidence for language in use in specific contexts. This prospect is central to perhaps the greatest rift in 20th century linguistics: between, on the one hand, generative linguists who argued against evidence of use (as a distraction from the mental system of language), and, on the other hand, most other linguists, including those in pragmatics, sociolinguistics, Cognitive Linguistics, and corpus linguistics, who see language in use as the central object of study.

Dirk Geeraerts, in Theories of Lexical Semantics, provides a useful, concise summary of the theoretical background to distributional semantics using corpora. Explicitly, a valuation of language in use can be traced through the work of linguistic anthropologist Bronislaw Malinowsky, who argued in the 1930s that language should only be investigated, and could only be understood, in contexts of use. Malinowsky was an influence on Firth, who in turn influenced the next generation of British corpus linguists, including Michael Halliday and John Sinclair. Firth himself was already arguing in the 1930s that ‘the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously’. Just a bit later, Wittgenstein famously asserted in Philosophical Investigations that linguistic meaning is inseparable from use, an assertion quoted by Firth, and echoed by the the philosopher of language John Austin, who was seminal in the development of linguistic pragmatics. Austin approached language as speech acts, instances of use in complex, real-world contexts, that could only be understood as such. The focus on language in use can subsequently be seen throughout later 20th-century developments in the fields of pragmatics and corpus research, as well as in sociolinguistics. Thus, some of the early theoretical work that facilitated the rise of corpus linguistics, and distributional methods, can first be seen in the spheres of philosophy and even anthropology.

Meaning as Contingent, Meaning as Encyclopedic

In order to argue that lexical co-occurrence in use is a source of information about meaning, we must also accept a particular definition of meaning. Traditionally, it was argued that there is a neat distinction between constant meaning and contingent meaning. Constant meaning was viewed as the meaning related to the word itself, while contingent meaning was viewed as not related to the word itself, but instead related to broader contexts of use, including the surrounding words, the medium of communication, real-world knowledge, connotations, implications, and so on. Contingent meaning was by definition contributed by context; context is exactly what is examined in proximity measures and distributional methods. So distributional methods are today generally employed to investigate semantics, but they are in fact used to investigate an element of meaning that was often not traditionally considered the central element of semantics, but instead a peripheral element.

In relation to this emphasis on contingent meaning, corpus linguistics has developed alongside the theory of encyclopedic semantics. In encyclopedic semantics, it is argued that there is any dividing line between constant and contingent meaning is arbitrary. Thus, corpus semanticists who use proximity measures and distributional approaches do not often argue that they are investigating contingent meaning. Instead, they may argue that they are investigating semantics, and that semantics in its contemporary (encyclopedic) sense is a much broader thing than in its more traditional sense.

Distributional methods therefore represent not only an ontology of language as use, but also an ontology of semantics as including what was traditionally known as contingent meaning.

To be continued…

Having discussed the theoretical and philosophical underpinnings of distributional methods here, I will go on to discuss the practical background of these methods in the next blog post.