Category Archives: Blog Archive

A theoretical background to distributional methods (pt. 1 of 2)

Introduction

When discussing proximity data and distributional methods in corpus semantics, it is common for linguists to refer to Firth’s famous “dictum”, ‘you shall know a word by the company it keeps!’ In this post, I look a bit more closely at the theoretical traditions from which this approach to semantics in contexts of use has arisen, and the theoretical links between this approach and other current work in linguistics. (For a synopsis of proximity data and distributional methods, see previous posts here, here, and here.)

Language as Use

Proximity data and distributional evidence can only be observed in records of language use, like corpora. The idea of investigating language in use reflects an ontology of language—the idea that language is language in use. If that basic definition is accepted, then the linguist’s job is to investigate language in use, and corpora constitute an excellent source of concrete evidence for language in use in specific contexts. This prospect is central to perhaps the greatest rift in 20th century linguistics: between, on the one hand, generative linguists who argued against evidence of use (as a distraction from the mental system of language), and, on the other hand, most other linguists, including those in pragmatics, sociolinguistics, Cognitive Linguistics, and corpus linguistics, who see language in use as the central object of study.

Dirk Geeraerts, in Theories of Lexical Semantics, provides a useful, concise summary of the theoretical background to distributional semantics using corpora. Explicitly, a valuation of language in use can be traced through the work of linguistic anthropologist Bronislaw Malinowsky, who argued in the 1930s that language should only be investigated, and could only be understood, in contexts of use. Malinowsky was an influence on Firth, who in turn influenced the next generation of British corpus linguists, including Michael Halliday and John Sinclair. Firth himself was already arguing in the 1930s that ‘the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously’. Just a bit later, Wittgenstein famously asserted in Philosophical Investigations that linguistic meaning is inseparable from use, an assertion quoted by Firth, and echoed by the the philosopher of language John Austin, who was seminal in the development of linguistic pragmatics. Austin approached language as speech acts, instances of use in complex, real-world contexts, that could only be understood as such. The focus on language in use can subsequently be seen throughout later 20th-century developments in the fields of pragmatics and corpus research, as well as in sociolinguistics. Thus, some of the early theoretical work that facilitated the rise of corpus linguistics, and distributional methods, can first be seen in the spheres of philosophy and even anthropology.

Meaning as Contingent, Meaning as Encyclopedic

In order to argue that lexical co-occurrence in use is a source of information about meaning, we must also accept a particular definition of meaning. Traditionally, it was argued that there is a neat distinction between constant meaning and contingent meaning. Constant meaning was viewed as the meaning related to the word itself, while contingent meaning was viewed as not related to the word itself, but instead related to broader contexts of use, including the surrounding words, the medium of communication, real-world knowledge, connotations, implications, and so on. Contingent meaning was by definition contributed by context; context is exactly what is examined in proximity measures and distributional methods. So distributional methods are today generally employed to investigate semantics, but they are in fact used to investigate an element of meaning that was often not traditionally considered the central element of semantics, but instead a peripheral element.

In relation to this emphasis on contingent meaning, corpus linguistics has developed alongside the theory of encyclopedic semantics. In encyclopedic semantics, it is argued that there is any dividing line between constant and contingent meaning is arbitrary. Thus, corpus semanticists who use proximity measures and distributional approaches do not often argue that they are investigating contingent meaning. Instead, they may argue that they are investigating semantics, and that semantics in its contemporary (encyclopedic) sense is a much broader thing than in its more traditional sense.

Distributional methods therefore represent not only an ontology of language as use, but also an ontology of semantics as including what was traditionally known as contingent meaning.

To be continued…

Having discussed the theoretical and philosophical underpinnings of distributional methods here, I will go on to discuss the practical background of these methods in the next blog post.

Concepts Slide

Operationalising concepts (Manifesto pt. 3 of 3)

Concepts Slide

Properties of concepts, from Susan Fitzmaurice’s presentation

This blog post completes our series of three extracts from Susan Fitzmaurice’s paper on “Concepts and Conceptual Change in Linguistic DNA”. (See parts 1 and 2.)

The supra-lexical approach to the process of concept recognition that I’ve described depends upon an encyclopaedic perspective on semantics (e.g. cf. Geeraerts, 2010: 222-3). This is fitting as ‘encyclopaedic semantics is an implicit precursor to or foundation of most distributional semantics or collocation studies’ (Mehl, p.c.). However, such studies do not typically pause to model or theorise before conducting analysis of concepts and semantics as expressed lexically. In other words, semasiological (and onomasiological) studies work on the premise of ready-made or at least ready lexicalised concepts, and proceed from there. This means that although they depend upon the prior application of encyclopaedic semantics, they themselves do not need to model or theorise this semantics because it belongs to the cultural messiness that yields the lexical expressions that they then proceed to analyse.

For LDNA, concepts are not discrete or componential lexical semantic meanings; neither are they abstract or ideal. Instead, they consist of associations of lexical/phrasal/constructional semantic and pragmatic meanings in use.
This encyclopaedic perspective suggests the following operationalisation of a concept for LDNA:

  1. Concepts resemble encyclopaedic meanings (which are temporally and culturally situated chunks of knowledge about the world expressed in a distributed way) rather than discrete or componential meanings. [This coincides with non-modular theories of mind, which adopt a psychological approach to concepts.]
  2. Concepts can be expressed in texts by (typically a combination of) words, phrases, constructions, or even by implicatures or invited inferences (and possibly by textual absences).
  3. Concepts are traceable in texts primarily via significant syntagmatic (associative) relations (of words/phrases/constructions/meanings) and secondarily via significant paradigmatic (alternate) relations (of words/phrases/constructions/meanings).
  4. A concept in a given historical moment might not be encapsulated in any observed word, phrase, or construction, but might instead only be observable via a complete set of words, phrases, or constructions in syntagmatic or paradigmatic relation to each other.

It is worth noting however, that concept recognition is particularly difficult (for the automatic processes built into LDNA) because it ordinarily depends upon the level of cultural literacy possessed by a reader. This is a quality which, while we cannot incorporate it as a process, we can take it into account by testing distant reading through close reading.

As well as being encyclopaedic, our approach is also experiential, in that the conceptual structure of early modern discourse is a reflection of the way early modern people experienced the world around them. That discourse presents a particular subjective view of the world with the hierarchical network of preferences which emerges as a network of concepts in discourse. In this way we also assume a perspectival nature of concept organisation.

Concluding remarks: Testing and tracking conceptual change across time and style

All being well, if we succeed in visualising the results of an iterative and developing set of procedures to inspect the data from these large corpora, we hope to be able to discern and locate the emergence of concepts in the universe of early modern English print. A number of questions arise about where and how these will show up.

For instance, following our hypothesis, will we see the cementation of a concept in the persistent co-occurrence in particular contexts of candidate conjuncts (both binomials and alternates), bigrams, and ultimately, ‘keywords’? (e.g. ‘man of business’ → ‘businessman’ in late Modern English newspapers)

And, as part of the notion of context, it is worth considering the role of discourse genre in the emergence of a concept and in conceptual change. For instance, if it is the case that a concept emerges, not as a keyword, but in the form of an association of expressions that functions as a loose paraphrase, is this kind of process more likely to occur in a specific discourse genre than in general discourse? In other words, is it possible that technical or specialist discourses will be the locus of new concepts, concepts which might diffuse gradually into public and more general ones? (e.g. dogma, law, science → newpapers, narrative, etc.)

What we hope to do is to make our approach manifest and our results visual. For instance, the emergence of a concept might be envisaged as clusters of texts rising up on the terrain representing a certain feature. And the reminder that they might not just gradually change over time, rising and falling across the terrain, but there might instead be islands of certain features that appear in distant time periods, disparate genres, sub-genres. All of that can be identified by the computer, but we have to make sense of it as close readers afterwards.

References

Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: OUP.

Defining the content of a concept from below (Manifesto pt. 2 of 3)

This blog post features the second of three extracts from Susan Fitzmaurice’s paper on “Concepts and Conceptual Change in Linguistic DNA”. (See previous post.)

Before tackling the problem of actually defining the content of a concept ‘from below’, we need to imagine ourselves into the position of being able to recognize the emergence of material that is a candidate for being considered a concept. Let’s briefly consider the question of ‘when is a concept’; in other words, how will we recognize something that is relevant, resonant and important in historical, cultural and political terms for our periods of interest?

In a manner that is not trivial, we want our research process to perform the discovery work of an innocent reader, a reader who approaches a universe of discourse without an agenda, but with a will to discover what the text yields up as worthy of notice. This innocent reader is an ideal reader of course; as humans are pattern finders, pattern matchers and meaning makers, it is virtually impossible to imagine a process that is truly ab initio. Indeed, a situation in which the reader is not primed to notice specific features, characteristics or meanings by the cotext or broader context is rare indeed.

The aim is for our processes to imitate the intuitive, intelligent scanning that human readers perform as they survey the universe of discourse in which they are interested (literary and historical documents). We assume that readers gradually begin to notice patterns, perhaps prominent combinations or associations, patterns that appear in juxtaposition in some places and in connection in others (Divjak & Gries, 2012). The key process is the noticing in the text the formation of ideas that gather cohesion and content in linguistic expression. We hypothesize that in the process of noticing, the reader begins to attribute increasing weight to the meanings they locate in the text. One model for this hypothesis is the experience of the foreign language learner who reads a text with her attention drawn to the expressions she recognises and can construe.

The principal problem posed by our project is therefore to extract from the discourse stuff that we might be able to discern as potential concepts. In other words, we aim to identify a concept from the discourse inwards by inspecting the language instead of defining a concept from its content outward (i.e. starting with a term and discerning its meaning). If we move from the discourse inwards, the meanings that we attribute weight to may be implicit and distributed across a stretch of text, in a text window.

Extract from Richard Wolley's 'Present State of France' (1687)

Extract from ‘Present State of France….’ (Richard Wolley, 1687). (EEBO-TCP A27526)

That is, the meanings we notice as relevant might not be encapsulated in individual lexical items or character strings within a simple syntactic frame. This recognition requires that we resist the temptation to treat a word or a character string as coterminous with a concept. Indeed, the more we associate relevance with, say, the frequency of a particular word or character string in a sub-corpus, the less likely we are to be able to look beyond the word as an index of a concept. To remain open and receptive in the process of candidate concept recognition, we need to expand the range of the things we inspect on the one hand and the scope of the context we read on the other.

The linguistic material that will be relevant to the identification of a concept will consist of a combination or set of expressions in association that occur in a concentrated fashion in a stretch of text. Importantly, this material may consist of lexical items, phrases, sentences, and may be conveyed metaphorically as well as literally, and likely pragmatically (by implicature and invited inference) as well as semantically. If the linguistic elaboration (definition, paraphrase, implication) of a concept precedes the lexicalization of a concept, it is reasonable to assume that the appearance of regularly and frequently occurring expressions in degrees of proximity within a window will aid the identification of a concept.

The scope of the context in which a concept appears is likely to be greater than the phrase or sentence that is the context for the keyword that we customarily consider in collocation studies. This context is akin to the modern notion of the paragraph, or, the unit of discourse which conventionally treats a topic or subject with the commentary that makes up the content of the paragraph. The stretch of text relevant for the identification of conceptual material may thus amount to a paragraph, a page, or a short text.

The linguistic structure of a concept has been shown to be built both paradigmatically (via synonymy) and syntagmatically (via lexical associations, syntax, paraphrase). For our purposes, given that the task entails picking up clues to the construction of concepts from the linguistic material in the context, where ‘context’ is defined pretty broadly, paradigmatic relations are less likely to be salient than syntagmatic relations like paraphrase, vagueness and association, perhaps more than predictable relations like antonymy and polysemy.

See the final post in this Manifesto series.

References

 

Divjak, Dagmar & Gries, Stefan Th. (eds). 2012. Frequency effects in language learning and processing (Vol. 1). Berlin: De Gruyter

 

Susan Fitzmaurice at DH & Conceptual Change event (photo: Mikko Tolonen)

A manifesto for studying conceptual change (Manifesto pt. 1 of 3)

As those who follow our Twitter account will know, Linguistic DNA’s principal investigator, Susan Fitzmaurice, was among the invited speakers at the recent symposium on Digital Humanities & Conceptual Change (organised by Mikko Tolonen, at the University of Helsinki). It was an opportunity to set out the distinctive approach being taken by our project and the theoretical understanding of concepts that underpins it. What follows is the first of three blog posts based on extracts from the paper, aka the Linguistic DNA ‘manifesto’. Susan writes:

Linguistic DNA’s goal is to understand the ways in which the concepts (or paradigmatic terms) that define modernity emerge in the universe of Early Modern discourse. The methodology we are committed to developing, testing and using, i.e. the bottom-up querying of a universe of printed discourse in English, demands that we take a fresh look at the notion of a concept and its content. So how will we operationalise a concept, and how will we recognise a concept in the data?

Defining the content of a concept from above

Historians and semanticists alike tend to start by identifying a set of key concepts and pursue their investigation by using a paradigmatic approach. For semanticists, this entails identifying a ‘concept’ in onomasiological terms as a bundle of (near-)synonyms that refer to aspects of the semantic space occupied by a concept in order to chart conceptual change in different periods and variation in different lects.

Historians, too, have identified key concepts through keywords or paradigmatic terms, which they then explore through historiography and the inspection of historical documents, seeking the evidence that underpins the emergence of particular terms and the forces and circumstances in which these change (Reinhart Koselleck’s Begriffsgeschichte or Quentin Skinner’s competing discourses). Semanticists and historians alike tend to approach concepts in a primarily semasiological way, for example, Anna Wierzbicka (2010) focuses on the history of evidence, and Naomi Tadmor (1996) uses ‘kin’ as a starting point for exploring concepts based on the meanings of particular words.

Philosophers of science, who are interested in the nature of conceptual change as driven or motivated by scientific inquiry and technological advances, may see concepts and conceptual change differently. For example, Ingo Brigandt (2010) argues that a scientific concept consists of a definition, its ‘inferential role’ or ‘reference potential’ and the epistemic goal pursued by the term’s use in order to account for the rationality of semantic change in a concept. So the change in the meaning of ‘gene’, from the classical gene which is about inheritance in the 1910s and 1920s, to the molecular gene in the 1960s and 1970s which is about characteristics, can be shown to be motivated by the changing nature of the explanatory task required of the term ‘gene’. In such a case, the goal is to explain the way in which the scientific task changes the meaning associated with the terms, rather than exploring the change itself. Thus Brigandt tries to make it explicit that

‘apart from a changing meaning (inferential role) [the concept also has] an epistemic goal which is tied to a concept’s use and which is the property setting the standards for which changes in meaning are rational’ (2010: 24).

His understanding of the pragmatics-driven structure of a concept is a useful basis for the construction of conceptual change as involving polysemy through the processes of invited inference and conversational implicature (cf. Traugott & Dasher, 2002; Fitzmaurice, 2015).

In text-mining and information retrieval work in biomedical language processing, as reported in Genome Biology, concept recognition is used to extract information about gene names from the literature. William Baumgartner et al. (2008) argue that

‘Concepts differ from character strings in that they are grounded in well-defined knowledge resources. Concept recognition provides the key piece of information missing from a string of text—an unambiguous semantic representation of what the characters denote’ (2008: S4).

Admittedly, this is a very narrow definition, but given the range of different forms and expressions that a gene or protein might have in the text, the notion of concept recognition needs to go well beyond the character string and ‘identification of mentions in text’. So they developed ‘mention regularization’ procedures and disambiguation techniques as a basis for concept recognition involving ‘the more complex task of identifying and extracting protein interaction relations’ (Baumgartner et al. 2008: S7-15).

In LDNA, we are interested in investigating what people (in particular periods) would have considered to be emerging and important cultural and political concepts in their own time by exploring their texts. This task involves, not identifying a set of concepts in advance and mining the literature of the period to ascertain the impact made by those concepts, but querying the literature to see what emerges as important. Therefore, our approach is neither semasiological, whereby we track the progress and historical fortunes of a particular term, such as marriage, democracy or evidence, nor is it onomasiological, whereby we inspect the paradigmatic content of a more abstract, yet given, notion such as TRUTH or POLITY, etc. We have to take a further step back, to consider the kind of analysis that precedes the implementation of either a semasiological or an onomasiological study of the lexical material we might construct as a concept (e.g. as indicated by a keyword).

See the next post in this Manifesto series.

Distributional Semantics II: What does distribution tell us about semantic relations?

Distributional Semantics II: What does distribution tell us about semantic relations?

In a previous post, I outlined a range of meanings that have been discussed in conjunction with distributional analysis. The Linguistic DNA team is assessing what exactly it can determine about semantics based on distributional analysis: from encyclopaedic meaning to specific semantic relations. In my opinion, the idea that distributional data indicates ‘semantics’ has generally been a relatively vague one: what exactly about ‘semantics’ is indicated? In this post, I’d like to clarify what distribution can tell us about semantic relations in particular, including synonymy, hyponymy, and co-hyponymy.

In the Natural Language Processing (NLP) sphere, numerous studies have tested the effectiveness of distributional data in identifying semantic relations. Turney and Pantel (2010) provide a useful survey of such studies, many of which involve machine learning, and computer performance on synonymy tests including those found on English language exams. Examples of success on synonymy tests have employed windows of anything from +/-4 words up to +/-150 words, but such studies tend not to test various approaches against each other, and they rarely dissect the notion of synonymy, much less co-hyponymy or other semantic relations.

Only a few studies have tested distributional methods as indicators of specific semantic relations. The Quantitative Lexicology and Variational Linguistics (QLVL) team at KU Leuven has addressed this problem in several papers. For example, Peirsman et al. (2007) looked at evidence for synonymy, hyponymy, and co-hyponymy in proximity data for Dutch. (A hyponym is a word whose meaning is a member of a larger category – for example, a crow and a robin are both types of bird, so crow and robin are both hyponyms of bird, and crow and robin are co-hyponyms of each other, but they are not synonyms of each other). Peirsman et al. looked at raw proximity measures as well as proximity measures that incorporate syntactic dependency information. Their findings demonstrate that in Dutch, synonymy and hyponymy are more readily indicated by proximity analyses that include syntactic dependency. On the other hand, they show that co-hyponymy is most effectively evidenced by raw proximity measures that do not include syntactic information. This finding is a startling result, with fascinating implications for linguistic theory. Why should ignoring syntactic information provide better measures of co-hyponymy? Might English be similar? How about Early Modern English?

I think it is important to note that in Peirsman et al. (ibid.), 6.3% of words that share similar distributional characteristics with a given word, or node, are synonyms with that node, and 4.0% are hyponyms of that node. Put differently, about 94% of words identified by distributional analysis aren’t synonyms, and round 70% of the words elicited in these measures are not semantically related to the node at all. Experienced corpus semanticists will not be surprised by this. But what happens to the majority of words, which aren’t related in any clear way? A computer algorithm will output all significant co-occurrences. Often, the co-occurrences that are not intuitively meaningful are quietly ignored by the researcher. It seems to me that if we are going to ignore such outputs, we must do so explicitly and with complete transparency. But this raises bigger questions: If we trust our methods, why should we ignore counterintuitive outputs? Or are these methods valuable simply as reproducible heuristics? I would argue that we should be transparent about our perspective on our own methods.

Also from QLVL, Heylen et al. (2008a) tests which types of syntactic dependency relations are most effective at indicating synonymy in Dutch nouns, and finds that Subject and Object relations most consistently indicate synonymy, but that adjective modification can give the best (though less consistent) indication of synonymy. In fact, adjective modification can be even better than a combined method using adjective modification and Subject/Object relations. Again, the findings are startling, and fascinating—why would the consideration of Subject/Object relations actually hinder the effective use of adjective modification as evidence of synonymy? The answer is not entirely clear. In a comparable study, Van der Plas and Bouma (2005) found Direct Object relations and adjective modification to be the most effective relations in identifying synonymy in Dutch. Unlike Heylen et al.’s (2008a) findings, Van der Plas and Bouma (2005) found that combining dependency relations improved synonym identification.

Is proximity data more effective in determining the semantics of highly frequent words? Heylen et al. (2008b) showed that in Dutch, high frequency nouns are more likely to collocate within +/-3 words with nouns that have a close semantic similarity, in particular synonyms and hyponyms. Low frequency nouns are less likely to do so. In addition, in Dutch, syntactic information is the best route to identifying synonymy and hyponymy overall, but raw proximity information is in fact slightly better at retrieving synonyms for medium-frequency nouns. This finding, then, elaborates on the finding in Peirsman et al. (2007; above).

How about word class? Peirsman et al. (2008) suggest, among other things, that in Dutch, a window of +/-2 words best identifies semantic similarity for nouns, while +/-4 to 7 words is most effective for verbs.

For Linguistic DNA, it is important to know exactly what we can and can’t expect to determine based on distributional analysis. We plan to employ distributional analysis using a range of proximity windows as well as syntactic information. The team will continue to report on this question as we move forward.

*Castle Arenberg, in the photo above, is part of KU Leuven, home of QLVL and many of the studies cited in this post. (Credit: Juhanson. Licence: CC BY-SA 3.0.)

References

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk. 2008a. Automatic synonymy extraction: A Comparison of Syntactic Context Models. In Verberne, Suzan; van Halteren, Hans; Coppen, Peter-Arno (eds), Computational linguistics in the Netherlands 2007. Amsterdam: Rodopi, 101-16.

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk. 2008b. Modelling word similarity: An evaluation of automatic synonymy extraction algorithms. In: Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation. Marrakech: European Language Resources Association, 3243-49.

Peirsman, Yves; Heylen, Kris; Speelman, Dirk. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the 2007 Workshop on Contextual Information in Semantic Space Models: Beyond Words and Documents, 9-16.

Peirsman, Yves; Heylen, Kris; Geeraerts, Dirk. 2008. Size matters: tight and loose context definitions in English word space models. In Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 34-41.

Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141-188.

van der Plas, Lonneke and Gosse Bouma. 2005. Syntactic Contexts for finding Semantically Similar Words. In Proceedings of CLIN 04.

Family and Friends in 18th century England (book cover)

Naomi Tadmor: Semantic analysis of keywords in context

Family and Friends in 18th century England (book cover)On 30 October, Prof. Naomi Tadmor led a workshop at the University of Sheffield, hosted by the Sheffield Centre for Early Modern Studies. In what follows, I briefly summarise Tadmor’s presentation, and then provide some reflections related to my own work, and to Linguistic DNA.

The key concluding points that Tadmor forwarded are, I think, important for any work with historical texts, and thus also crucial to historical research:

  • Understanding historical language (including word meaning) is necessary for understanding historical texts
  • To understand historical language we must analyse it in context.
  • Analysing historical language in context requires close reading.

Whether we identify as historians, linguists, corpus linguists, literary scholars, or otherwise, we would do well to keep these points in mind.

Tadmor’s take on historical keywords

Tadmor’s specific arguments in the master class focused on kinship terms. In Early Modern English (EModE), there was a broad array of referents for kinship terms such as brother, mother, father, sister, and associated terms like family and friend, which are not likely to be intuitive to a speaker of Present Day English (PDE). Evidence shows, for example, that family often referred to all of the individuals living in a household, including servants, to the possible exclusion of biological relations living outside of the household. The paper Tadmor asked us to read in advance (first published in 1996), supplemented with other examples at the masterclass, provides extensive illustrations of the nuance of family and other kinship terms.

In EModE, there was also a narrow range of semantic or pragmatic implications related to kinship terms: these meanings generally involved social expectations, social networks, or social capital. So, father could refer to ‘biological father’ or ‘father-in-law’ (or even ‘King’), and implied a relationship of social expectation (rather than, for example, a relationship of affection or intimacy, as might be implied in PDE).

By identifying both the array of referents and the implications or senses conveyed by these kinship terms, Tadmor provides a thorough illustration of the terms’ lexical semantics. We can see this method as being motivated by historical questions (about the nature of Early Modern relationships); driven in its first stage by lexicology (insofar as it begins by asking about words, their referents, and senses); and then, in a final stage, employing lexicological knowledge to analyse texts and further address the initial historical questions. Tadmor avoids circularity by using one data set (in her 1996 paper) to identify a hypothesis regarding lexical semantics, and another data set to test her hypothesis. What do these observations about lexical semantics tell us about history? As Tadmor notes, it is by identifying these meanings that we can begin to understand categories of social actions and relationships, as well as motivations for those actions and relationships. Perhaps more fundamentally, it is only by understanding semantics in historical texts, that we can begin to understand the texts meaningfully.

A Corpus Linguist’s take on Tadmor’s methods

Reflecting on Tadmor’s talk, I’m reminded of the utility of the terms semasiology and onomasiology. In semantic research, semasiology is an approach which examines a term as an object of inquiry, and proceeds to identify the meanings of that word. Onomasiology is an approach which begins with a meaning, and then identifies the various terms for expressing it. Tadmor’s method is largely semasiological, insofar as it looks at the meanings of the term family and other kinship terms. This approach begins in a relatively straightforward way—find all of the instances of the word (or lemma), and you can then identify its various senses. The next step is more difficult: how do you distinguish its senses? In linguistics, a range of methods is available, with varying degrees of rigour and reproducibility, and it is important that these methods be outlined clearly. Tadmor’s study is also onomasiological, as she compares the different ways (often within a single text) of referring to a given member of the household family. This approach is less straightforward: how do you identify each time a member of the family is referred to? Again, a range of methods is available, each with its own advantages and disadvantages. A clear statement and justification of the choice of method renders any study more rigorous. In my experience, the systematicity of thinking in terms of onomasiology and semasiology is useful in developing a systematic and rigorous study.

Semasiology and onomasiology allow us to distinguish types of study and approaches to meaning, which can in turn help render our methods more explicit and clear. Similarly, distinguishing editorially between a word (e.g. family) and a meaning (e.g. ‘family’) is useful for clarity. Indeed, thinking methodologically in terms of semasiology and onomasiology encourages clarity of expression editorially regarding terms and meanings. In Tadmor’s 1996 paper, double quotes (e.g. “family”) are used to refer to either the word family or the meaning ‘family’ at various points. At times, such a paper could be rendered more clear, it seems to me, by adopting consistent editorial conventions like those used in linguistics (e.g. quotes or all caps for meanings, italics for terms). The distinction between a term and a meaning is by nature not always clear or certain: that difficulty is all the more reason for journals to adhere to rigorously defined editorial conventions.

From the distinction between terms and concepts, we can move to the distinction between senses and referents. It is important to be explicit both about changes in referent and changes in sense, when discussing semasiological change. For example, as historians and linguists, we must be sure that when we identify changes in a word’s referents (e.g. father referring to ‘father-in-law’), we also identify whether there are changes in its sense (e.g. ‘a relationship of social expectation’ or ‘a relationship of affection and intimacy’). When Thomas Turner refers to his father-in-law as father, he seems to be using the term, as identified by Tadmor, in its Early Modern sense implying ‘a relationship of social expectation’ rather than in the possible PDE sense implying ‘a relationship of affection and intimacy’. The terms referent and sense allow for this distinction, and are useful in practice when conducting this kind of semantic analysis.

Of course, if a term becomes polysemous, it can be applied to a new range of referents, with a new sense, or even with new implicatures or connotations. For example, we can imagine (perhaps counterfactually) a historical development in which family might have come to refer to cohabitants who were not blood relations. At the same time, in referring to those cohabitants who were not blood relations, family might have ceased to imply any kind of social expectation, social network, or social capital. That is, it’s possible for both the referent and the sense to change. In this case, as Tadmor has shown, that doesn’t seem to be what’s happened, but it’s important to investigate such possible polysemies.

Future possibilities: Corpus linguistics

As a corpus linguist, I’d be interested in investigating Tadmor’s semantic findings via a quantitative onomasiological study, looking more closely at selection probabilities. Such a study could ask research questions like:

  • Given that an Early Modern writer is expressing ‘nuclear family’, what is the probability of using term a, b, etc., in various contexts?
  • Given that a writer is expressing ‘household-family’, what is the probability of using term a, b, etc., in various contexts?
  • Given that a writer is expressing ‘spouse’s father’ or ‘brother’s sister’, etc., what is the probability of using term a, b, etc., in various contexts?

These onomasiological research questions (unlike semasiological ones) allow us to investigate logical probabilities of selection processes. This renders statistical analyses more robust. Changes in probabilities of selection over time are a useful illustration of onomasiological change, which is an essential part of semantic change.

And for Linguistic DNA?

For Linguistic DNA, I see (at least) two major questions related to Tadmor’s work:

  1. Can automated distributional analysis uncover the types of phenomena that Tadmor has uncovered for family?
  2. What is a concept for Tadmor, and how can her work inform our notion of a concept?

In response to the first question, it is certainly possible that distributional analysis can reflect changing referents (such as ‘father-in-law’ referred to as father). Hypothetically, the distribution of father with a broad array of referents might entail a broad array of lexical co-occurrences. In practice, however, this might be very, very difficult to discern. Hence Tadmor’s call for close reading. It is perhaps more likely that the sense (as opposed to referent) of father as ‘a relationship involving social expectations’ might be reflected in co-occurrence data: hypothetically, father might co-occur with words related to social expectation and obligation. We have evidence that semantically related words tend to constitute only about 30% of significant co-occurrences. Optimistically, it might be that the remaining 70% of words do suggest semantic relationships, if we know how to interpret them—in this case, maybe some co-occurrences with family would suggest the referents or implications discussed here. Pessimistically, it might be that if only 30% of co-occurring words are semantically related, then there would be an even lower probability of finding co-occurring words that reveal such fine semantic or pragmatic nuances as these. Thanks to Tadmor’s work, Linguistic DNA might be able to use family as a test case for what can be revealed by distributional analysis.

What is a concept? Tadmor (1996) doesn’t define concept, and sometimes switches quickly, for example, between discussing the concept ‘family’ and the word family, which can be tricky to follow. At times, concept for Tadmor seems to be similar to definition—a gloss for a term. At other times, concept seems to be broader, suggesting something perhaps with psycholinguistic reality, a sort of notion or idea held in the mind that might relate to an array of terms. Or, concept seems to relate to discourses, to shared social understandings that are shaped by language use. Linguistic DNA is paying close attention to operationalising and/or defining concept in its approach to conceptual and semantic change in EModE. Tadmor’s work points in the same direction that interests us, and the vagueness of concept which Tadmor engages with is vagueness that we are engaging with as well.

From Data to Evidence (d2e): conference reflections

HelsinkiFraser and Iona report (November 2015):

Six members of the Linguistic DNA team were present at the recent d2e conference held by the VARIENG research unit at the University of Helsinki, Finland. The focus of the conference was on tools and methodologies employed in corpus linguistics, whilst the event took for its theme ‘big data, rich data, uncharted data’. The conference offered much food for thought, raising our awareness of the tools and methods employed by other researchers in similar fields. Frequently it was clear that despite the differences between the goals of, for example, sociolinguistics and historical semantics, the knowledge and approach towards data taken by one could be effectively and productively applied to another.

The conference’s plenary speeches were of particular interest. Tony McEnery delineated potential limitations of corpus data and its analysis. His call for researchers to remain aware of the limitations of their data struck a chord with our findings from close examination of EEBO data in its raw and processed forms. One of his main conclusions was the importance of conducting cyclical researchanalysing the data with software tools and then returning to the data itself to verify the validity of the findings. LDNA is set up to follow this approach, and Professor McEnery’s presentation reaffirmed its importance. Plenaries by Jane Winters and Päivi Pahta looked further into working with historical data andin the latter particularlyhistorical linguistic data, whilst a fascinating presentation by Mark Davies emphasised the importance of corpus size in the type of research which we are undertaking.

LDNA is also taking an active interest in innovative approaches to data analysis and visualisation. Demonstrating software, Gerold Schneider, Eetu Mäkelä, and Jonathan Hope each showcased new tools for representing historical language data and wrangling with metadata. As we progress in our thinking about the kinds of processing which will allow us to identify concepts in our data, we are always on the lookout for ideas and methodological developments which might help us to improve our own findings.

Several research papers connected with the interests of LDNA, especially when they adhered closely to the conference’s theme of exploring large and complex datasets in ways which reveal new patterns in the data. James McCracken’s presentation on adding frequency information to the Oxford English Dictionary was very exciting for the possibilities it could open up to future historical linguistics. (We’ve blogged before about the drawback of not having relevant frequency data when using tools like VARD.) Meanwhile, the techniques used to track change in words’ behaviour, with different dimensions of semantic evolution scrutinised by Hendrik De Smet (for Hansard), Gerold Schneider (in COHA), and Hannah Kermes and Stephania Degaetano-Ortlieb of Saarland University (working with the Royal Scientific Corpus) were not only intrinsically fascinating but provide useful pointers towards the depth and complexity of linguistic features LDNA will need to consider. We will also aim to keep in view Joseph Flanagan’s insistence that linguistic studies should aim for reproducibility, an insistence aided (for those who code with R) by the suite of tools he recommended.

The d2e conference packed a lot into a few days, creating an intense and productive atmosphere in which participants could meet, exchange ideas, and become more aware of the scope of others’ work in related fields. We enjoyed the conversations around our own poster, and much appreciated the hospitality throughout. It was a great opportunity for the LDNA team, providing more invaluable input to our thought and approach to our work.

——-

Abstracts from the conference are available from the d2e pages on the Varieng website.

Anni Aarinen provides a write-up of McEnery’s keynote.

Glasgow-based LDNA member Brian Aitken has written up his d2e experience on the Digital Humanities blog.

Distributional Semantics I: What might distribution tell us about word meaning?

Residence

Distributional Semantics I: What might distribution tell us about word meaning?

In a previous post, I asked ‘What is the link between corpus data showing lexical usage, on the one hand, and lexical semantics or concepts, on the other?’ In this post, I’d like to forward that discussion by addressing one component of it: how we observe lexical semantics (or word meaning) via distributional data in texts. That is, how do we know what we know about semantics from distributional data?

Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.

So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?

Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.

An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.

In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.

Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?

Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?

And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.

By definition, we know that in proximity data, we are observing words that co-occur. Which leaves us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations. In the next post (coming soon), I report on exactly that.

References

Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press

Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.

University of Sussex--view of campus from above

Workshop Reflections

University of Sussex--view of campus from above

A fortnight ago, our first methodology workshop was held at the University of Sussex. It was a full programme and productive for the project team with lots of opportunities for us to test out our thinking about how we move forward, and it has given us plenty to think about. We can perhaps best summarise some of the overarching themes by starting with the questions we began with and some more that were raised during the event.

Top in our minds going in were questions such as ‘What is a concept?’ How will we recognise one when we find it? How exactly do we (should we) go about finding concepts in the first place? Our thinking on these matters has taken a step forward thanks to these discussions, and the next couple of blog posts are already in preparation to explore what we’ve learned and the directions that this suggests for us in the coming months. Suggestions that were raised included investigating synonymous terms, and the relationships between onomasiological conceptual fields. Our ideas are to some extent still forming as we consider these suggestions afresh and work on developing our ideas in the process.

Another major question was of the importance of marking up and pre-processing the data before we begin to run our own processes. The issue of spelling regularisation has formed a large part of our initial work on the data of EEBO, with our comparison of the VARD and MorphAdorner tools being documented in several earlier posts. It is not only spelling that is at issue; pre-processing texts with MorphAdorner and the Historical Thesaurus Semantic Tagger also offer layers of annotation. As a result, because our new processes can be designed to take in multiple types of input (e.g. lemma, part of speech) or combinations of these, we were curious to learn what workshop participants thought we should prioritise.

There was extensive discussion about the extent to which the text should be pre-processed before being loaded into the parser, and there was some disagreement over whether spelling regularisation is itself a necessary step or whether it ought not to be used because it skews word frequency counts. Whether or not an individual method of pre-processing proves fruitful – or, indeed, if it is better to process the raw text itself – it is ultimately to our benefit to have explored these avenues and to be able to say with authority what has been successful and what has not.

A final core point was the question of the technology which we plan to use and how we can build on the most effective tools already produced for linguistic research.  As the Digital team at Sheffield (@HRIDigital) are beginning work on building the parser, we wanted to consider what parts of that process should be created from scratch and what parts can be effectively accomplished using software which already exists.

In the course of presentations and discussions, participants drew our attention to a variety of tools. We have prioritised these for our attention, including those for identifying synonymy and polysemy, word sense disambiguation, novel sense detection, and topic identification. The result is fresh ideas for some technologies to investigate, and so the research associates have got to work learning about tools such as Gensim, HiDEx (High Dimensional Explorer), and BlackLab.

From the very start, we have been clear that we want to be able to understand and explain as much as possible how our processes work, rather than create something which acts as a ‘black box’, devouring input and producing results in a manner that cannot be evaluated or understood. Conducting these discussions while we’re still in the design phase has helped reinforce the value of that ideal for the team.

We firmly believe that drawing on the expertise and experience of the academic community in our field will make Linguistic DNA a stronger project. The workshop helped to progress our thinking, and we’d like to thank again everyone who attended the event—your input is hugely appreciated, and we look forward to sharing with you where it takes us!

The Historical Thesaurus of English and its Related Projects

One of the resources which the Linguistic DNA project is drawing on is the Historical Thesaurus of English. Organising every word in the language, present and past, into a hierarchical structure based on word-meaning, the Historical Thesaurus is an invaluable tool for historical semantic research. The data from the Thesaurus will be involved in the internal workings of the parser programme being developed at the Sheffield Humanities Research Institute (HRI), and be present in the annotated EEBO and ECCO corpora with which the parser is working.

 

Structure of the Thesaurus

 

At its top level, the Historical Thesaurus breaks the vocabulary of English into three main categories – ‘The external world’, ‘The mental world’, and ‘The social world’. These are further subdivided so that, for example, ‘The external world’ contains within it the categories ‘The earth’, ‘Life’, ‘Health and disease’, ‘People’, and ‘Animals’, amongst others. This subdivision continues to a maximum depth of seven levels, with a category number being assigned at each level. As a result, the category ‘Daily record/journal’, for instance, has the category number 03.09.06.01.02 (noun), comprised of the following steps:

03                                Society

03.09                           Communication

03.09.06                      Record

03.09.06.01                 Written record

03.09.06.01.02            Daily record/journal

This category, then, contains all the recorded words for a journal (journal, day-book, diary, memorial, ephemeris, diurnal, journal-book, diet-book), accompanied by the date ranges in which those words are known to have been used.

Each of the seven main category levels also contains subcategories where required, allowing an exceptionally fine-grained organisation of the semantic content of the language. The Thesaurus provides, therefore, a rich seam to be mined for information on lexical and conceptual development in the English-speaking world for the past two millennia.

 

Related Projects

 

Since the completion of the first edition of the Thesaurus in 2009, projects have begun to drill down into its data. A major project, Mapping Metaphor with the Historical Thesaurus, investigated every word in the Thesaurus in order to identify systematic metaphorical links between categories. Its primary output is a complete ‘metaphor map’ of the language, which provides fascinating insights into the ways in which certain concepts are discussed in terms of others. It also demonstrates strikingly just how prevalent metaphor is in the language at the level of individual words.

The SAMUELS project utilised the Thesaurus in an entirely different way, aiming to create semantic tagging software capable of labelling every word in a text with the code of the category in which that word sense can be found in the Thesaurus. This is no easy feat, given that some words have several hundred potential meanings – ‘set’, for example, has 345 entries (not including those where it is part of a multi-word phrase), whilst ‘run’ has 302. The semantic tagging tool was created and tested by a consortium of researchers based at the Universities of Lancaster, Glasgow, Huddersfield, and the University of Central Lancashire. It is currently the only software capable of assigning word meanings based on dating information, and this diachronic tagging ability allows it to be used on texts such as those contained in EEBO and ECCO with a high degree of accuracy. The tagged Hansard corpus, comprising all the speeches made in the British Houses of Parliament between 1803 and 2005, is publicly available via Mark Davies’ corpus website at Brigham Young University, with the tagged EEBO corpus to follow.

 

The Historical Thesaurus and Linguistic DNA

 

The output of the SAMUELS project forms a large part of the input to Linguistic DNA, in the form of the semantically-tagged EEBO corpus. It is hoped that the annotation of every word in the EEBO corpus with a Historical Thesaurus meaning code will allow more accurate automatic evaluation of word groupings which might constitute the kinds of concept that LDNA is looking to identify. The exact method in which this might be employed in the development of the HRI’s parser software is something the project will explore in the coming months.

One of the most interesting ways in which the Historical Thesaurus can be used in the Linguistic DNA project is to contrast the concepts constructed by the parser with the categories of the Thesaurus. This will allow the team to identify where the automatically identified concepts and their associated lexis accord with the decisions made manually by a team of lexicographers. Differences will, therefore, potentially provide areas for further research, perhaps indicating facets of the parser which can be improved or instigating evaluation of lexicographical data in light of the findings from the EEBO and ECCO textual corpora.

The Glasgow-based team working on Research Theme 3 is also interested in approaching the Thesaurus categories from a more statistical viewpoint, looking for significant shifts in the size of the vocabulary associated with Thesaurus semantic categories, investigating the words involved in the context of their use in EEBO.

These are strong starting points for the use of the Historical Thesaurus as part of the investigation procedures of the Linguistic DNA project, though they are not exhaustive and the project team is always open to new angles of employing EEBO data and Thesaurus data in combination.