Tag Archives: corpus linguistics

What does EEBO represent? Part II: Corpus linguistics and representativeness

What exactly does EEBO represent? Is it representative?

Often, the question of whether a corpus or data set is representative is answered first by describing what the corpus does and does not contain. What does EEBO contain? As Iona Hine has explained here, EEBO contains Early Modern English, but it is much larger than that in some ways, and also much more limited than that. EEBO contains many languages other than English, which were printed in the British Isles (and beyond) between 1476 and 1700. But EEBO is also limited: it contains only print, whereas Early Modern English was also hand-written and spoken, across a large number of varieties.

Given that EEBO contains Early Modern print, does EEBO represent Early Modern print? In order to address this question meaningfully, it’s crucial first to define representativeness.

In corpus linguistics, as in other data sciences and in statistics, representativeness is a relationship that holds between a sample and a population. A sample represents a larger population if the sample was obtained rigorously and systematically in relation to a well-defined population. If the sample is not representative in this way, it is an arbitrary sample or a convenience sample – i.e. it was not obtained rigorously and systematically in relation to a well-defined population. Representativeness allows us to examine the sample and then draw conclusions about the population. This is a fundamental element of inferential statistics, which is used in data science from epidemiology to corpus linguistics.

Was EEBO sampled systematically and rigorously in relation to a well-defined population? Not at all. EEBO was sampled arbitrarily, by convenience – first, including only texts that have (arbitrarily) survived; then including texts that were (arbitrarily) available for scanning and transcription; and, finally, including those texts that were (arbitrarily) of interest to scholars involved with EEBO at the time. Could we, perhaps, argue that EEBO represents Early Modern print that survived until the 21st century, was available for scanning and transcription, and (in many cases) was of interest to scholars involved with the project at the time? I think we would have to concede that EEBO wasn’t sampled systematically and rigorously in relation to that definition, and that the arbitrary elements of that population render it ill-defined.

So, what does EEBO represent? Nothing at all.

It’s difficult, therefore, to test research questions using inferential statistics. For example, we might be interested in asking: Do preferences for the near-synonyms civil, public, and civic change over time in Early Modern print? We can pursue such a question in a straightforward way, looking at frequencies of each word over time, in context, to see if there are changes in use, with each word rising or falling in frequency. In fact, we can quite reliably discern what happens to these preferences within EEBO. But our question, as stated, was about Early Modern print. It is the quantitative step from the sample (EEBO) to the population (Early Modern print) that is problematic. Suppose that we do find a shifting preference for each of these words over time. Because EEBO doesn’t represent the population of Early Modern print in any clear way, we can’t rely on statistics to conclude that that this is in fact a correlation between preferences and time – or if it is, instead, an artefact of the arbitrariness of the sampling. The observation might be due to any number of textual or sociolinguistic variables that were left undefined in our arbitrary sample – including variation in topics, or genres, or authorial style, or even authors’ gender, age, education, or geographic profile.

It as though we were testing children’s medication on an arbitrary group of people who happened to be walking past the hospital on a given day. That’s clearly a problem. We want to be sure that children’s medication was tested on children – but not simply children, because we also want to be sure that it isn’t tested on children arbitrarily sampled, for example, from an elite after-school athletics programme for 9-year-olds that happens to be adjacent to the hospital. We want the medication to be tested on a systematic cross-section of children, or on a group of children that we know is composed of more and less healthy kids across a defined age range, so that we can draw conclusions about all children, based on our sample. If we use a statistical analysis of EEBO (an arbitrary sample) to draw conclusions about Early Modern print (a population), it’s as though we’re using an arbitrary sample of available kids to prove that a medication is safe for the population of all kids. (Linguistics is a lot safer than epidemiology.)

If one were interested in reliably representing extant Early Modern print, one might design a representative sample in various ways. It would be possible to systematically identify genres or topics or even text lengths, and ensure that all were sampled. If we took on such a project, we might want to ensure sampling all genders, education levels, and so on (indeed, historical English corpora such as the Corpus of English Dialogues, or ARCHER, are systematically sampled in clear ways). We would need to take decisions about proportionality – if we’re interested in comparing the writing of men and women, for example, we might want large, equal samples of each group. But if we wanted proportional representation across the entire population of writers, we might include a majority of men, with a small proportion of women – reflecting the bias in Early Modern publishing. Or, we might go further and attempt to represent not the bias in Early Modern publication, but instead the bias in Early Modern reception, attempting to represent how many readers actually read women’s works compared to men’s works (though such metadata isn’t readily available, and obtaining it would be a project in itself). Each of these decisions might be appropriate for different purposes.

So, what are we to do? LDNA hasn’t thrown stats out the window, nor have we thrown EEBO out the window. But we are careful to remember that our statistics are describing EEBO rather than indicating conclusions about a broader population. And we haven’t stopped there – we will draw conclusions about Early Modern print, but not via statistics, and not simply via the sample that is EEBO. Instead, we will draw such conclusions as close readers, linguists, philologists, and historians. We will use qualitative tools and historical, social, cultural, political, and economic insights about Early Modern history, in systematic and rigorous ways. Our intention is to read texts and contexts, and to evaluate those contexts in relation to our own knowledge about history, society, and culture. In other words, we are taking a principled interpretive leap from EEBO to Early Modern print. That leap is necessary, because there’s no inherent representative connection between the two.

Select a language... option.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”


The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd).  Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!


* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

LDNA’s first year: Reflections from RA Seth Mehl

In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.

Semantics and concepts

The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.

My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).

Operations

Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.

Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.

Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.

All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.

A Theoretical Background to Distributional Methods (pt. 2 of 2)

Introduction

In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.

Means of analysis

The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.

Other links: Psycholinguistics and Discourse Analysis

The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.

There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’

In conclusion

In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.

A theoretical background to distributional methods (pt. 1 of 2)

Introduction

When discussing proximity data and distributional methods in corpus semantics, it is common for linguists to refer to Firth’s famous “dictum”, ‘you shall know a word by the company it keeps!’ In this post, I look a bit more closely at the theoretical traditions from which this approach to semantics in contexts of use has arisen, and the theoretical links between this approach and other current work in linguistics. (For a synopsis of proximity data and distributional methods, see previous posts here, here, and here.)

Language as Use

Proximity data and distributional evidence can only be observed in records of language use, like corpora. The idea of investigating language in use reflects an ontology of language—the idea that language is language in use. If that basic definition is accepted, then the linguist’s job is to investigate language in use, and corpora constitute an excellent source of concrete evidence for language in use in specific contexts. This prospect is central to perhaps the greatest rift in 20th century linguistics: between, on the one hand, generative linguists who argued against evidence of use (as a distraction from the mental system of language), and, on the other hand, most other linguists, including those in pragmatics, sociolinguistics, Cognitive Linguistics, and corpus linguistics, who see language in use as the central object of study.

Dirk Geeraerts, in Theories of Lexical Semantics, provides a useful, concise summary of the theoretical background to distributional semantics using corpora. Explicitly, a valuation of language in use can be traced through the work of linguistic anthropologist Bronislaw Malinowsky, who argued in the 1930s that language should only be investigated, and could only be understood, in contexts of use. Malinowsky was an influence on Firth, who in turn influenced the next generation of British corpus linguists, including Michael Halliday and John Sinclair. Firth himself was already arguing in the 1930s that ‘the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously’. Just a bit later, Wittgenstein famously asserted in Philosophical Investigations that linguistic meaning is inseparable from use, an assertion quoted by Firth, and echoed by the the philosopher of language John Austin, who was seminal in the development of linguistic pragmatics. Austin approached language as speech acts, instances of use in complex, real-world contexts, that could only be understood as such. The focus on language in use can subsequently be seen throughout later 20th-century developments in the fields of pragmatics and corpus research, as well as in sociolinguistics. Thus, some of the early theoretical work that facilitated the rise of corpus linguistics, and distributional methods, can first be seen in the spheres of philosophy and even anthropology.

Meaning as Contingent, Meaning as Encyclopedic

In order to argue that lexical co-occurrence in use is a source of information about meaning, we must also accept a particular definition of meaning. Traditionally, it was argued that there is a neat distinction between constant meaning and contingent meaning. Constant meaning was viewed as the meaning related to the word itself, while contingent meaning was viewed as not related to the word itself, but instead related to broader contexts of use, including the surrounding words, the medium of communication, real-world knowledge, connotations, implications, and so on. Contingent meaning was by definition contributed by context; context is exactly what is examined in proximity measures and distributional methods. So distributional methods are today generally employed to investigate semantics, but they are in fact used to investigate an element of meaning that was often not traditionally considered the central element of semantics, but instead a peripheral element.

In relation to this emphasis on contingent meaning, corpus linguistics has developed alongside the theory of encyclopedic semantics. In encyclopedic semantics, it is argued that there is any dividing line between constant and contingent meaning is arbitrary. Thus, corpus semanticists who use proximity measures and distributional approaches do not often argue that they are investigating contingent meaning. Instead, they may argue that they are investigating semantics, and that semantics in its contemporary (encyclopedic) sense is a much broader thing than in its more traditional sense.

Distributional methods therefore represent not only an ontology of language as use, but also an ontology of semantics as including what was traditionally known as contingent meaning.

To be continued…

Having discussed the theoretical and philosophical underpinnings of distributional methods here, I will go on to discuss the practical background of these methods in the next blog post.

Distributional Semantics I: What might distribution tell us about word meaning?

Residence

Distributional Semantics I: What might distribution tell us about word meaning?

In a previous post, I asked ‘What is the link between corpus data showing lexical usage, on the one hand, and lexical semantics or concepts, on the other?’ In this post, I’d like to forward that discussion by addressing one component of it: how we observe lexical semantics (or word meaning) via distributional data in texts. That is, how do we know what we know about semantics from distributional data?

Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.

So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?

Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.

An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.

In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.

Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?

Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?

And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.

By definition, we know that in proximity data, we are observing words that co-occur. Which leaves us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations. In the next post (coming soon), I report on exactly that.

References

Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press

Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.