Author Archives: Seth Mehl

Looking back, looking forward: Linguistic DNA in 2016 and 2017

As we move into 2017, we’ve been looking back at achievements in 2016, and ahead to what we aim to achieve in the coming year.

2016 was an outwardly busy year as we travelled to Bruges, Essen, Krakow, Lausanne, Leeds, Brighton, Murcia, Nottingham, Paris, Saarbrucken, and Utrecht, sharing more of our thinking and early data with different audiences. Closer to “home”, we benefitted from the exchange of ideas with LDNA-hosted panels at Sheffield DH Congress and our second methodological workshop in Sussex. In 2017, we will be focusing back on our interface development and some more in-depth research, though we intend to be present at DH, SHEL, ICAME and SHARP, in order to continue some fruitful conversations.

On the blog, we have been reflecting on representativeness and the nature of EEBO-TCP. We’ve also documented our decision not to use ECCO’s OCR data to analyse eighteenth century print. You can expect to hear about the alternative 18th century datasets we’re choosing to work with later in 2017.

During the Autumn, the LDNA researchers collaborated on two articles about the project, its theory and praxis, both (hopefully) to be published this year following peer review. Generating examples from each research theme based on our early data and tying these together effectively was an enjoyable challenge, and we have already used the draft of one piece as part of our briefing materials for upcoming MA placements at The Digital Humanities Institute | Sheffield (formerly known as HRI Digital).

In the past six months, the Sheffield team have captured funding for two additional applications of the Linguistic DNA “concept modelling” tools:

  • The ESRC project Ways of Being in a Digital Age combines our quantitative insights with a qualitative literature survey of academic publications. Scheduled to inform the ESRC’s next programme of digital society funding, this impact-full study has compelled us toward rapid prototype development. The interface being put together to serve ‘WoBDA’ colleagues will also form the kernel of the subsequent LDNA workbench.
  • From next month, we are involved in another funded impact-related project, collaborating with the University of Leeds to explore the conceptual structure of millions of YouTube video comments on the theme of militarisation, as part of a larger project funded by the Swedish Research Council. This is a six-month commitment, bringing in a further research associate to theorise what’s involved in applying our measures to some very different data.

We also have three significant applications in place for other pots of funding, including Horizon 2020 collaborations, attesting confidence about our nascent processes and the multifarious opportunities for their application and impact.

Meanwhile, Glasgow has been using the present word co-occurrence data to develop its methodology for investigating processor data from the perspective of key Historical Thesaurus categories. We have continued to develop analysis of Thesaurus categories, looking for those which show abnormal instances of growth or decline; a provisional methodology for establishing statistical ‘baselines’ has been plotted out which is now being implemented and refined. Further possibilities are being tested, such as amalgamating data across whole layers of the HT hierarchy rather than by individual category, and the effects of separating out part of speech within categories or layers.

What does EEBO represent? Part II: Corpus linguistics and representativeness

What exactly does EEBO represent? Is it representative?

Often, the question of whether a corpus or data set is representative is answered first by describing what the corpus does and does not contain. What does EEBO contain? As Iona Hine has explained here, EEBO contains Early Modern English, but it is much larger than that in some ways, and also much more limited than that. EEBO contains many languages other than English, which were printed in the British Isles (and beyond) between 1476 and 1700. But EEBO is also limited: it contains only print, whereas Early Modern English was also hand-written and spoken, across a large number of varieties.

Given that EEBO contains Early Modern print, does EEBO represent Early Modern print? In order to address this question meaningfully, it’s crucial first to define representativeness.

In corpus linguistics, as in other data sciences and in statistics, representativeness is a relationship that holds between a sample and a population. A sample represents a larger population if the sample was obtained rigorously and systematically in relation to a well-defined population. If the sample is not representative in this way, it is an arbitrary sample or a convenience sample – i.e. it was not obtained rigorously and systematically in relation to a well-defined population. Representativeness allows us to examine the sample and then draw conclusions about the population. This is a fundamental element of inferential statistics, which is used in data science from epidemiology to corpus linguistics.

Was EEBO sampled systematically and rigorously in relation to a well-defined population? Not at all. EEBO was sampled arbitrarily, by convenience – first, including only texts that have (arbitrarily) survived; then including texts that were (arbitrarily) available for scanning and transcription; and, finally, including those texts that were (arbitrarily) of interest to scholars involved with EEBO at the time. Could we, perhaps, argue that EEBO represents Early Modern print that survived until the 21st century, was available for scanning and transcription, and (in many cases) was of interest to scholars involved with the project at the time? I think we would have to concede that EEBO wasn’t sampled systematically and rigorously in relation to that definition, and that the arbitrary elements of that population render it ill-defined.

So, what does EEBO represent? Nothing at all.

It’s difficult, therefore, to test research questions using inferential statistics. For example, we might be interested in asking: Do preferences for the near-synonyms civil, public, and civic change over time in Early Modern print? We can pursue such a question in a straightforward way, looking at frequencies of each word over time, in context, to see if there are changes in use, with each word rising or falling in frequency. In fact, we can quite reliably discern what happens to these preferences within EEBO. But our question, as stated, was about Early Modern print. It is the quantitative step from the sample (EEBO) to the population (Early Modern print) that is problematic. Suppose that we do find a shifting preference for each of these words over time. Because EEBO doesn’t represent the population of Early Modern print in any clear way, we can’t rely on statistics to conclude that that this is in fact a correlation between preferences and time – or if it is, instead, an artefact of the arbitrariness of the sampling. The observation might be due to any number of textual or sociolinguistic variables that were left undefined in our arbitrary sample – including variation in topics, or genres, or authorial style, or even authors’ gender, age, education, or geographic profile.

It as though we were testing children’s medication on an arbitrary group of people who happened to be walking past the hospital on a given day. That’s clearly a problem. We want to be sure that children’s medication was tested on children – but not simply children, because we also want to be sure that it isn’t tested on children arbitrarily sampled, for example, from an elite after-school athletics programme for 9-year-olds that happens to be adjacent to the hospital. We want the medication to be tested on a systematic cross-section of children, or on a group of children that we know is composed of more and less healthy kids across a defined age range, so that we can draw conclusions about all children, based on our sample. If we use a statistical analysis of EEBO (an arbitrary sample) to draw conclusions about Early Modern print (a population), it’s as though we’re using an arbitrary sample of available kids to prove that a medication is safe for the population of all kids. (Linguistics is a lot safer than epidemiology.)

If one were interested in reliably representing extant Early Modern print, one might design a representative sample in various ways. It would be possible to systematically identify genres or topics or even text lengths, and ensure that all were sampled. If we took on such a project, we might want to ensure sampling all genders, education levels, and so on (indeed, historical English corpora such as the Corpus of English Dialogues, or ARCHER, are systematically sampled in clear ways). We would need to take decisions about proportionality – if we’re interested in comparing the writing of men and women, for example, we might want large, equal samples of each group. But if we wanted proportional representation across the entire population of writers, we might include a majority of men, with a small proportion of women – reflecting the bias in Early Modern publishing. Or, we might go further and attempt to represent not the bias in Early Modern publication, but instead the bias in Early Modern reception, attempting to represent how many readers actually read women’s works compared to men’s works (though such metadata isn’t readily available, and obtaining it would be a project in itself). Each of these decisions might be appropriate for different purposes.

So, what are we to do? LDNA hasn’t thrown stats out the window, nor have we thrown EEBO out the window. But we are careful to remember that our statistics are describing EEBO rather than indicating conclusions about a broader population. And we haven’t stopped there – we will draw conclusions about Early Modern print, but not via statistics, and not simply via the sample that is EEBO. Instead, we will draw such conclusions as close readers, linguists, philologists, and historians. We will use qualitative tools and historical, social, cultural, political, and economic insights about Early Modern history, in systematic and rigorous ways. Our intention is to read texts and contexts, and to evaluate those contexts in relation to our own knowledge about history, society, and culture. In other words, we are taking a principled interpretive leap from EEBO to Early Modern print. That leap is necessary, because there’s no inherent representative connection between the two.

Sociolinguistics Symposium 21: conference reflections

Linguistic DNA Co-Investigator Justyna Robinson attended the Sociolinguistics Symposium at the University of Murcia, Spain, 15-18 June. This year’s conference theme was ‘attitudes and prestige’, and the event included over 1,000 presentations. Justyna represented LDNA with a poster in the general poster session entitled ‘Linguistic DNA: Modelling concepts and semantic change in English, 1500-1800’. Below, she reflects on her experience:

For the LDNA project, one of the really important panel sessions was the one organised by Terttu Nevalainen and Marijke van der Wal, entitled Historical sociolinguistics: Dispelling myths about the past. The session included papers which aimed at revisiting a range of assumptions about the past and the study of the past that are not supported by historical sociolinguistic research. In doing so, particularly important for LDNA, were papers of a methodological nature in which methodologies of historical linguistic research were interrogated. For example, in ‘People, work, values: Tracing societal change through linguistic shifts’, Minna Palander-Collin, Anni Sairio, Minna Nevala, and Brendan Humphries (University of Helsinki) explored social changes in Britain between 1750 and 1900 by analysing keywords within the conceptual domains of PEOPLE, WORD, and VALUES. Questions that emerged from the discussion of this research included whether social shifts can be identified in keywords. This quickly led to asking what concepts are and what kind of relationship exists between keywords and concepts. Although the answer to this question wasn’t decided, there was a unanimous desire to explore the question further. Another observation from Palander-Collin et al.’s talk was that certain concepts can linger on in language when in practice the real–life referents designated by the concepts may be long gone. Miriam Meyerhoff added that this issue was also observed in New Zealand data, i.e. researchers looking at Maori keywords found out that references to certain plants lingered on in narratives of a community, well after the time these plants were used. In this discussion the audience continued to reference the LDNA project as well. It was great to hear that more and more people know about LDNA and are following our progress.

The LDNA poster presentation was set up in a beautiful setting in one of the cloisters at Murcia University. The poster attracted a lot of attention. In it, we presented first findings from using positive and negative PMI values to model discursive concepts around the word soldier in the window of +/-100 words. Having set such a large proximity window. we did not initially know whether what we would find would be interesting and useful in our quest to determine what concepts are. One conclusion from this analysis was that large proximity windows still yield meaningful information and clear semantic domains emerge that are important in grasping the discursive concepts around soldier. Another methodological finding of this research is the value of using negative PMI values in improving our understanding of what concepts are. Thus, soldier shows a notably rare association with a group or items that are a semantically cohesive group. These include religious terms, such as sin and church. One may ask whether this systematic weak correlation may indicate the end of a disappearing concept or the beginning of the development of a new concept. These questions will be soon answered by looking at our data diachronically.

_______________________

Abstracts from the conference are available from the conference website.

Conference report: Diachronic corpora and genre in Nottingham

On Friday 8 April 2016, Susan Fitzmaurice and Seth Mehl attended Diachronic corpora, genre, and language change at the University of Nottingham, where Seth gave a paper entitled Automatic genre identification in EEBO-TCP: A multidisciplinary perspective on problems and prospects. The event featured researchers from around the globe, exploring issues in historical data sets; the nature of genre and text types; and modelling diachronic change.

The day’s plenary speeches were engaging and insightful: Bethany Gray spoke about academic writing as a locus of linguistic change, in contrast to the common expectation that change originates in spoken language. This is particularly relevant for those of us working with older historical data, such that written language is our only evidence for change. Thomas Gloning described the Deutsche Textarchiv, and in particular the recent addition to that corpus of the Dingler Corpus, an essential record of written scientific German representing 1820 to 1932. Gloning presented the useful definition of text types or genres as ‘traditions of communicative action’. In analysing such text types, or traditions, it is possible to map syntax and lexis to text functions and topics, though Gloning cautions that some of the most important elements of such mapping are not currently achievable by machines. This is a careful, valuable perspective and approach, which relates to our own (as discussed below).

Other research papers included a presentation by Fabrizio Esposito who, like the Linguistic DNA project, is using distributional semantic methods. His work looks at recent change in White House Press Briefings. Bryan Jurish presented DiaCollo, a powerful tool for analysing and visualising collocation patterns as they change over time in very large data sets. Vaclav Brezina analysed lexical meaning in EEBO-TCP by measuring differences in collocation patterns across overlapping, sliding diachronic windows.

What did LDNA contribute?

LDNA is asking whether specific concepts emerge uniquely in particular genres, and whether and how those concepts are then adopted and adapted in other genres. Genre is a fuzzy concept, representing categories of texts. Such categories are characterised by formal features such as print layout, phonetics, morphosyntax, lexis, and semantics; and functional features such as purpose of composition, reader expectations, and social and cultural contexts. It is productive to distinguish approaches to genre in different contexts. For Early Modern Studies, categories may be inherited in the canon, and questioned and explored in relation to literature, history, or philosophical or cultural studies; corpus linguistics, often seeks a scientifically reproducible approach to genre and aims to learn about language and variation; while Natural Language Processing (NLP)often aims to engineer tools for solving specific tasks. At the Nottingham conference, Seth illustrated his remarks by reflecting on Ted Underwood’s work automatically identifying genres in HathiTrust texts via supervised machine learning. He then laid out the project’s plan of investigating genre (or text types) by categorising Early Modern texts using the outputs of the LDNA processor, alongside other formal text features. This relates to Gloning’s aforementioned assertion that text topic and function might be mapped onto syntax and lexis; in our case, it is a combined mapping of discursive topics or conceptual fields, lexis, morphosyntax, and additional formal features such as the presence of foreign words or the density of punctuation or parts of speech that will allow us to group texts into categories in a relatively data-driven way.

The conference was very well organised by Richard J. Whitt, with a lovely lunch and dinner in which attendees shared ideas and dug further into linguistic issues. Susan and Seth were delighted to participate.

LDNA’s first year: Reflections from RA Seth Mehl

In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.

Semantics and concepts

The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.

My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).

Operations

Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.

Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.

Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.

All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.

A Theoretical Background to Distributional Methods (pt. 2 of 2)

Introduction

In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.

Means of analysis

The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.

Other links: Psycholinguistics and Discourse Analysis

The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.

There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’

In conclusion

In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.

A theoretical background to distributional methods (pt. 1 of 2)

Introduction

When discussing proximity data and distributional methods in corpus semantics, it is common for linguists to refer to Firth’s famous “dictum”, ‘you shall know a word by the company it keeps!’ In this post, I look a bit more closely at the theoretical traditions from which this approach to semantics in contexts of use has arisen, and the theoretical links between this approach and other current work in linguistics. (For a synopsis of proximity data and distributional methods, see previous posts here, here, and here.)

Language as Use

Proximity data and distributional evidence can only be observed in records of language use, like corpora. The idea of investigating language in use reflects an ontology of language—the idea that language is language in use. If that basic definition is accepted, then the linguist’s job is to investigate language in use, and corpora constitute an excellent source of concrete evidence for language in use in specific contexts. This prospect is central to perhaps the greatest rift in 20th century linguistics: between, on the one hand, generative linguists who argued against evidence of use (as a distraction from the mental system of language), and, on the other hand, most other linguists, including those in pragmatics, sociolinguistics, Cognitive Linguistics, and corpus linguistics, who see language in use as the central object of study.

Dirk Geeraerts, in Theories of Lexical Semantics, provides a useful, concise summary of the theoretical background to distributional semantics using corpora. Explicitly, a valuation of language in use can be traced through the work of linguistic anthropologist Bronislaw Malinowsky, who argued in the 1930s that language should only be investigated, and could only be understood, in contexts of use. Malinowsky was an influence on Firth, who in turn influenced the next generation of British corpus linguists, including Michael Halliday and John Sinclair. Firth himself was already arguing in the 1930s that ‘the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously’. Just a bit later, Wittgenstein famously asserted in Philosophical Investigations that linguistic meaning is inseparable from use, an assertion quoted by Firth, and echoed by the the philosopher of language John Austin, who was seminal in the development of linguistic pragmatics. Austin approached language as speech acts, instances of use in complex, real-world contexts, that could only be understood as such. The focus on language in use can subsequently be seen throughout later 20th-century developments in the fields of pragmatics and corpus research, as well as in sociolinguistics. Thus, some of the early theoretical work that facilitated the rise of corpus linguistics, and distributional methods, can first be seen in the spheres of philosophy and even anthropology.

Meaning as Contingent, Meaning as Encyclopedic

In order to argue that lexical co-occurrence in use is a source of information about meaning, we must also accept a particular definition of meaning. Traditionally, it was argued that there is a neat distinction between constant meaning and contingent meaning. Constant meaning was viewed as the meaning related to the word itself, while contingent meaning was viewed as not related to the word itself, but instead related to broader contexts of use, including the surrounding words, the medium of communication, real-world knowledge, connotations, implications, and so on. Contingent meaning was by definition contributed by context; context is exactly what is examined in proximity measures and distributional methods. So distributional methods are today generally employed to investigate semantics, but they are in fact used to investigate an element of meaning that was often not traditionally considered the central element of semantics, but instead a peripheral element.

In relation to this emphasis on contingent meaning, corpus linguistics has developed alongside the theory of encyclopedic semantics. In encyclopedic semantics, it is argued that there is any dividing line between constant and contingent meaning is arbitrary. Thus, corpus semanticists who use proximity measures and distributional approaches do not often argue that they are investigating contingent meaning. Instead, they may argue that they are investigating semantics, and that semantics in its contemporary (encyclopedic) sense is a much broader thing than in its more traditional sense.

Distributional methods therefore represent not only an ontology of language as use, but also an ontology of semantics as including what was traditionally known as contingent meaning.

To be continued…

Having discussed the theoretical and philosophical underpinnings of distributional methods here, I will go on to discuss the practical background of these methods in the next blog post.

Distributional Semantics II: What does distribution tell us about semantic relations?

Distributional Semantics II: What does distribution tell us about semantic relations?

In a previous post, I outlined a range of meanings that have been discussed in conjunction with distributional analysis. The Linguistic DNA team is assessing what exactly it can determine about semantics based on distributional analysis: from encyclopaedic meaning to specific semantic relations. In my opinion, the idea that distributional data indicates ‘semantics’ has generally been a relatively vague one: what exactly about ‘semantics’ is indicated? In this post, I’d like to clarify what distribution can tell us about semantic relations in particular, including synonymy, hyponymy, and co-hyponymy.

In the Natural Language Processing (NLP) sphere, numerous studies have tested the effectiveness of distributional data in identifying semantic relations. Turney and Pantel (2010) provide a useful survey of such studies, many of which involve machine learning, and computer performance on synonymy tests including those found on English language exams. Examples of success on synonymy tests have employed windows of anything from +/-4 words up to +/-150 words, but such studies tend not to test various approaches against each other, and they rarely dissect the notion of synonymy, much less co-hyponymy or other semantic relations.

Only a few studies have tested distributional methods as indicators of specific semantic relations. The Quantitative Lexicology and Variational Linguistics (QLVL) team at KU Leuven has addressed this problem in several papers. For example, Peirsman et al. (2007) looked at evidence for synonymy, hyponymy, and co-hyponymy in proximity data for Dutch. (A hyponym is a word whose meaning is a member of a larger category – for example, a crow and a robin are both types of bird, so crow and robin are both hyponyms of bird, and crow and robin are co-hyponyms of each other, but they are not synonyms of each other). Peirsman et al. looked at raw proximity measures as well as proximity measures that incorporate syntactic dependency information. Their findings demonstrate that in Dutch, synonymy and hyponymy are more readily indicated by proximity analyses that include syntactic dependency. On the other hand, they show that co-hyponymy is most effectively evidenced by raw proximity measures that do not include syntactic information. This finding is a startling result, with fascinating implications for linguistic theory. Why should ignoring syntactic information provide better measures of co-hyponymy? Might English be similar? How about Early Modern English?

I think it is important to note that in Peirsman et al. (ibid.), 6.3% of words that share similar distributional characteristics with a given word, or node, are synonyms with that node, and 4.0% are hyponyms of that node. Put differently, about 94% of words identified by distributional analysis aren’t synonyms, and round 70% of the words elicited in these measures are not semantically related to the node at all. Experienced corpus semanticists will not be surprised by this. But what happens to the majority of words, which aren’t related in any clear way? A computer algorithm will output all significant co-occurrences. Often, the co-occurrences that are not intuitively meaningful are quietly ignored by the researcher. It seems to me that if we are going to ignore such outputs, we must do so explicitly and with complete transparency. But this raises bigger questions: If we trust our methods, why should we ignore counterintuitive outputs? Or are these methods valuable simply as reproducible heuristics? I would argue that we should be transparent about our perspective on our own methods.

Also from QLVL, Heylen et al. (2008a) tests which types of syntactic dependency relations are most effective at indicating synonymy in Dutch nouns, and finds that Subject and Object relations most consistently indicate synonymy, but that adjective modification can give the best (though less consistent) indication of synonymy. In fact, adjective modification can be even better than a combined method using adjective modification and Subject/Object relations. Again, the findings are startling, and fascinating—why would the consideration of Subject/Object relations actually hinder the effective use of adjective modification as evidence of synonymy? The answer is not entirely clear. In a comparable study, Van der Plas and Bouma (2005) found Direct Object relations and adjective modification to be the most effective relations in identifying synonymy in Dutch. Unlike Heylen et al.’s (2008a) findings, Van der Plas and Bouma (2005) found that combining dependency relations improved synonym identification.

Is proximity data more effective in determining the semantics of highly frequent words? Heylen et al. (2008b) showed that in Dutch, high frequency nouns are more likely to collocate within +/-3 words with nouns that have a close semantic similarity, in particular synonyms and hyponyms. Low frequency nouns are less likely to do so. In addition, in Dutch, syntactic information is the best route to identifying synonymy and hyponymy overall, but raw proximity information is in fact slightly better at retrieving synonyms for medium-frequency nouns. This finding, then, elaborates on the finding in Peirsman et al. (2007; above).

How about word class? Peirsman et al. (2008) suggest, among other things, that in Dutch, a window of +/-2 words best identifies semantic similarity for nouns, while +/-4 to 7 words is most effective for verbs.

For Linguistic DNA, it is important to know exactly what we can and can’t expect to determine based on distributional analysis. We plan to employ distributional analysis using a range of proximity windows as well as syntactic information. The team will continue to report on this question as we move forward.

*Castle Arenberg, in the photo above, is part of KU Leuven, home of QLVL and many of the studies cited in this post. (Credit: Juhanson. Licence: CC BY-SA 3.0.)

References

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk. 2008a. Automatic synonymy extraction: A Comparison of Syntactic Context Models. In Verberne, Suzan; van Halteren, Hans; Coppen, Peter-Arno (eds), Computational linguistics in the Netherlands 2007. Amsterdam: Rodopi, 101-16.

Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk. 2008b. Modelling word similarity: An evaluation of automatic synonymy extraction algorithms. In: Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation. Marrakech: European Language Resources Association, 3243-49.

Peirsman, Yves; Heylen, Kris; Speelman, Dirk. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the 2007 Workshop on Contextual Information in Semantic Space Models: Beyond Words and Documents, 9-16.

Peirsman, Yves; Heylen, Kris; Geeraerts, Dirk. 2008. Size matters: tight and loose context definitions in English word space models. In Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 34-41.

Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141-188.

van der Plas, Lonneke and Gosse Bouma. 2005. Syntactic Contexts for finding Semantically Similar Words. In Proceedings of CLIN 04.

Family and Friends in 18th century England (book cover)

Naomi Tadmor: Semantic analysis of keywords in context

Family and Friends in 18th century England (book cover)On 30 October, Prof. Naomi Tadmor led a workshop at the University of Sheffield, hosted by the Sheffield Centre for Early Modern Studies. In what follows, I briefly summarise Tadmor’s presentation, and then provide some reflections related to my own work, and to Linguistic DNA.

The key concluding points that Tadmor forwarded are, I think, important for any work with historical texts, and thus also crucial to historical research:

  • Understanding historical language (including word meaning) is necessary for understanding historical texts
  • To understand historical language we must analyse it in context.
  • Analysing historical language in context requires close reading.

Whether we identify as historians, linguists, corpus linguists, literary scholars, or otherwise, we would do well to keep these points in mind.

Tadmor’s take on historical keywords

Tadmor’s specific arguments in the master class focused on kinship terms. In Early Modern English (EModE), there was a broad array of referents for kinship terms such as brother, mother, father, sister, and associated terms like family and friend, which are not likely to be intuitive to a speaker of Present Day English (PDE). Evidence shows, for example, that family often referred to all of the individuals living in a household, including servants, to the possible exclusion of biological relations living outside of the household. The paper Tadmor asked us to read in advance (first published in 1996), supplemented with other examples at the masterclass, provides extensive illustrations of the nuance of family and other kinship terms.

In EModE, there was also a narrow range of semantic or pragmatic implications related to kinship terms: these meanings generally involved social expectations, social networks, or social capital. So, father could refer to ‘biological father’ or ‘father-in-law’ (or even ‘King’), and implied a relationship of social expectation (rather than, for example, a relationship of affection or intimacy, as might be implied in PDE).

By identifying both the array of referents and the implications or senses conveyed by these kinship terms, Tadmor provides a thorough illustration of the terms’ lexical semantics. We can see this method as being motivated by historical questions (about the nature of Early Modern relationships); driven in its first stage by lexicology (insofar as it begins by asking about words, their referents, and senses); and then, in a final stage, employing lexicological knowledge to analyse texts and further address the initial historical questions. Tadmor avoids circularity by using one data set (in her 1996 paper) to identify a hypothesis regarding lexical semantics, and another data set to test her hypothesis. What do these observations about lexical semantics tell us about history? As Tadmor notes, it is by identifying these meanings that we can begin to understand categories of social actions and relationships, as well as motivations for those actions and relationships. Perhaps more fundamentally, it is only by understanding semantics in historical texts, that we can begin to understand the texts meaningfully.

A Corpus Linguist’s take on Tadmor’s methods

Reflecting on Tadmor’s talk, I’m reminded of the utility of the terms semasiology and onomasiology. In semantic research, semasiology is an approach which examines a term as an object of inquiry, and proceeds to identify the meanings of that word. Onomasiology is an approach which begins with a meaning, and then identifies the various terms for expressing it. Tadmor’s method is largely semasiological, insofar as it looks at the meanings of the term family and other kinship terms. This approach begins in a relatively straightforward way—find all of the instances of the word (or lemma), and you can then identify its various senses. The next step is more difficult: how do you distinguish its senses? In linguistics, a range of methods is available, with varying degrees of rigour and reproducibility, and it is important that these methods be outlined clearly. Tadmor’s study is also onomasiological, as she compares the different ways (often within a single text) of referring to a given member of the household family. This approach is less straightforward: how do you identify each time a member of the family is referred to? Again, a range of methods is available, each with its own advantages and disadvantages. A clear statement and justification of the choice of method renders any study more rigorous. In my experience, the systematicity of thinking in terms of onomasiology and semasiology is useful in developing a systematic and rigorous study.

Semasiology and onomasiology allow us to distinguish types of study and approaches to meaning, which can in turn help render our methods more explicit and clear. Similarly, distinguishing editorially between a word (e.g. family) and a meaning (e.g. ‘family’) is useful for clarity. Indeed, thinking methodologically in terms of semasiology and onomasiology encourages clarity of expression editorially regarding terms and meanings. In Tadmor’s 1996 paper, double quotes (e.g. “family”) are used to refer to either the word family or the meaning ‘family’ at various points. At times, such a paper could be rendered more clear, it seems to me, by adopting consistent editorial conventions like those used in linguistics (e.g. quotes or all caps for meanings, italics for terms). The distinction between a term and a meaning is by nature not always clear or certain: that difficulty is all the more reason for journals to adhere to rigorously defined editorial conventions.

From the distinction between terms and concepts, we can move to the distinction between senses and referents. It is important to be explicit both about changes in referent and changes in sense, when discussing semasiological change. For example, as historians and linguists, we must be sure that when we identify changes in a word’s referents (e.g. father referring to ‘father-in-law’), we also identify whether there are changes in its sense (e.g. ‘a relationship of social expectation’ or ‘a relationship of affection and intimacy’). When Thomas Turner refers to his father-in-law as father, he seems to be using the term, as identified by Tadmor, in its Early Modern sense implying ‘a relationship of social expectation’ rather than in the possible PDE sense implying ‘a relationship of affection and intimacy’. The terms referent and sense allow for this distinction, and are useful in practice when conducting this kind of semantic analysis.

Of course, if a term becomes polysemous, it can be applied to a new range of referents, with a new sense, or even with new implicatures or connotations. For example, we can imagine (perhaps counterfactually) a historical development in which family might have come to refer to cohabitants who were not blood relations. At the same time, in referring to those cohabitants who were not blood relations, family might have ceased to imply any kind of social expectation, social network, or social capital. That is, it’s possible for both the referent and the sense to change. In this case, as Tadmor has shown, that doesn’t seem to be what’s happened, but it’s important to investigate such possible polysemies.

Future possibilities: Corpus linguistics

As a corpus linguist, I’d be interested in investigating Tadmor’s semantic findings via a quantitative onomasiological study, looking more closely at selection probabilities. Such a study could ask research questions like:

  • Given that an Early Modern writer is expressing ‘nuclear family’, what is the probability of using term a, b, etc., in various contexts?
  • Given that a writer is expressing ‘household-family’, what is the probability of using term a, b, etc., in various contexts?
  • Given that a writer is expressing ‘spouse’s father’ or ‘brother’s sister’, etc., what is the probability of using term a, b, etc., in various contexts?

These onomasiological research questions (unlike semasiological ones) allow us to investigate logical probabilities of selection processes. This renders statistical analyses more robust. Changes in probabilities of selection over time are a useful illustration of onomasiological change, which is an essential part of semantic change.

And for Linguistic DNA?

For Linguistic DNA, I see (at least) two major questions related to Tadmor’s work:

  1. Can automated distributional analysis uncover the types of phenomena that Tadmor has uncovered for family?
  2. What is a concept for Tadmor, and how can her work inform our notion of a concept?

In response to the first question, it is certainly possible that distributional analysis can reflect changing referents (such as ‘father-in-law’ referred to as father). Hypothetically, the distribution of father with a broad array of referents might entail a broad array of lexical co-occurrences. In practice, however, this might be very, very difficult to discern. Hence Tadmor’s call for close reading. It is perhaps more likely that the sense (as opposed to referent) of father as ‘a relationship involving social expectations’ might be reflected in co-occurrence data: hypothetically, father might co-occur with words related to social expectation and obligation. We have evidence that semantically related words tend to constitute only about 30% of significant co-occurrences. Optimistically, it might be that the remaining 70% of words do suggest semantic relationships, if we know how to interpret them—in this case, maybe some co-occurrences with family would suggest the referents or implications discussed here. Pessimistically, it might be that if only 30% of co-occurring words are semantically related, then there would be an even lower probability of finding co-occurring words that reveal such fine semantic or pragmatic nuances as these. Thanks to Tadmor’s work, Linguistic DNA might be able to use family as a test case for what can be revealed by distributional analysis.

What is a concept? Tadmor (1996) doesn’t define concept, and sometimes switches quickly, for example, between discussing the concept ‘family’ and the word family, which can be tricky to follow. At times, concept for Tadmor seems to be similar to definition—a gloss for a term. At other times, concept seems to be broader, suggesting something perhaps with psycholinguistic reality, a sort of notion or idea held in the mind that might relate to an array of terms. Or, concept seems to relate to discourses, to shared social understandings that are shaped by language use. Linguistic DNA is paying close attention to operationalising and/or defining concept in its approach to conceptual and semantic change in EModE. Tadmor’s work points in the same direction that interests us, and the vagueness of concept which Tadmor engages with is vagueness that we are engaging with as well.

Distributional Semantics I: What might distribution tell us about word meaning?

Residence

Distributional Semantics I: What might distribution tell us about word meaning?

In a previous post, I asked ‘What is the link between corpus data showing lexical usage, on the one hand, and lexical semantics or concepts, on the other?’ In this post, I’d like to forward that discussion by addressing one component of it: how we observe lexical semantics (or word meaning) via distributional data in texts. That is, how do we know what we know about semantics from distributional data?

Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.

So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?

Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.

An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.

In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.

Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?

Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?

And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.

By definition, we know that in proximity data, we are observing words that co-occur. Which leaves us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations. In the next post (coming soon), I report on exactly that.

References

Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press

Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.