LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:
Tag Archives: Seth
What does EEBO represent? Part II: Corpus linguistics and representativeness
What exactly does EEBO represent? Is it representative?
Often, the question of whether a corpus or data set is representative is answered first by describing what the corpus does and does not contain. What does EEBO contain? As Iona Hine has explained here, EEBO contains Early Modern English, but it is much larger than that in some ways, and also much more limited than that. EEBO contains many languages other than English, which were printed in the British Isles (and beyond) between 1476 and 1700. But EEBO is also limited: it contains only print, whereas Early Modern English was also hand-written and spoken, across a large number of varieties.
Given that EEBO contains Early Modern print, does EEBO represent Early Modern print? In order to address this question meaningfully, it’s crucial first to define representativeness.
In corpus linguistics, as in other data sciences and in statistics, representativeness is a relationship that holds between a sample and a population. A sample represents a larger population if the sample was obtained rigorously and systematically in relation to a well-defined population. If the sample is not representative in this way, it is an arbitrary sample or a convenience sample – i.e. it was not obtained rigorously and systematically in relation to a well-defined population. Representativeness allows us to examine the sample and then draw conclusions about the population. This is a fundamental element of inferential statistics, which is used in data science from epidemiology to corpus linguistics.
Was EEBO sampled systematically and rigorously in relation to a well-defined population? Not at all. EEBO was sampled arbitrarily, by convenience – first, including only texts that have (arbitrarily) survived; then including texts that were (arbitrarily) available for scanning and transcription; and, finally, including those texts that were (arbitrarily) of interest to scholars involved with EEBO at the time. Could we, perhaps, argue that EEBO represents Early Modern print that survived until the 21st century, was available for scanning and transcription, and (in many cases) was of interest to scholars involved with the project at the time? I think we would have to concede that EEBO wasn’t sampled systematically and rigorously in relation to that definition, and that the arbitrary elements of that population render it ill-defined.
So, what does EEBO represent? Nothing at all.
It’s difficult, therefore, to test research questions using inferential statistics. For example, we might be interested in asking: Do preferences for the near-synonyms civil, public, and civic change over time in Early Modern print? We can pursue such a question in a straightforward way, looking at frequencies of each word over time, in context, to see if there are changes in use, with each word rising or falling in frequency. In fact, we can quite reliably discern what happens to these preferences within EEBO. But our question, as stated, was about Early Modern print. It is the quantitative step from the sample (EEBO) to the population (Early Modern print) that is problematic. Suppose that we do find a shifting preference for each of these words over time. Because EEBO doesn’t represent the population of Early Modern print in any clear way, we can’t rely on statistics to conclude that that this is in fact a correlation between preferences and time – or if it is, instead, an artefact of the arbitrariness of the sampling. The observation might be due to any number of textual or sociolinguistic variables that were left undefined in our arbitrary sample – including variation in topics, or genres, or authorial style, or even authors’ gender, age, education, or geographic profile.
It as though we were testing children’s medication on an arbitrary group of people who happened to be walking past the hospital on a given day. That’s clearly a problem. We want to be sure that children’s medication was tested on children – but not simply children, because we also want to be sure that it isn’t tested on children arbitrarily sampled, for example, from an elite after-school athletics programme for 9-year-olds that happens to be adjacent to the hospital. We want the medication to be tested on a systematic cross-section of children, or on a group of children that we know is composed of more and less healthy kids across a defined age range, so that we can draw conclusions about all children, based on our sample. If we use a statistical analysis of EEBO (an arbitrary sample) to draw conclusions about Early Modern print (a population), it’s as though we’re using an arbitrary sample of available kids to prove that a medication is safe for the population of all kids. (Linguistics is a lot safer than epidemiology.)
If one were interested in reliably representing extant Early Modern print, one might design a representative sample in various ways. It would be possible to systematically identify genres or topics or even text lengths, and ensure that all were sampled. If we took on such a project, we might want to ensure sampling all genders, education levels, and so on (indeed, historical English corpora such as the Corpus of English Dialogues, or ARCHER, are systematically sampled in clear ways). We would need to take decisions about proportionality – if we’re interested in comparing the writing of men and women, for example, we might want large, equal samples of each group. But if we wanted proportional representation across the entire population of writers, we might include a majority of men, with a small proportion of women – reflecting the bias in Early Modern publishing. Or, we might go further and attempt to represent not the bias in Early Modern publication, but instead the bias in Early Modern reception, attempting to represent how many readers actually read women’s works compared to men’s works (though such metadata isn’t readily available, and obtaining it would be a project in itself). Each of these decisions might be appropriate for different purposes.
So, what are we to do? LDNA hasn’t thrown stats out the window, nor have we thrown EEBO out the window. But we are careful to remember that our statistics are describing EEBO rather than indicating conclusions about a broader population. And we haven’t stopped there – we will draw conclusions about Early Modern print, but not via statistics, and not simply via the sample that is EEBO. Instead, we will draw such conclusions as close readers, linguists, philologists, and historians. We will use qualitative tools and historical, social, cultural, political, and economic insights about Early Modern history, in systematic and rigorous ways. Our intention is to read texts and contexts, and to evaluate those contexts in relation to our own knowledge about history, society, and culture. In other words, we are taking a principled interpretive leap from EEBO to Early Modern print. That leap is necessary, because there’s no inherent representative connection between the two.
From Spring to Summer: LDNA on the road
For the past couple of months, our rolling horizon has looked increasingly full of activity. This new blogpost provides a brief update on where we’ve been and where we’re going. We’ll be aiming to give more thorough reports on some of these activities after the events.
Where we’ve been
In May, Susan, Iona and Mike travelled to Utrecht, at the invitation of Joris van Eijnatten and Jaap Verheul. Together with colleagues from Sheffield’s History Department, we presented the different strands of Digital Humanities work ongoing at Sheffield. We learned much from our exchanges with Utrecht’s AsymEnc and Translantis research programs, and enjoyed shared intellectual probing of visualisations of change across time. We look forward to continued engagement with each others’ work.
A week later, Seth and Justyna participated in This&THATCamp at the University of Sussex (pictured), with LDNA emerging second in a popular poll of topics for discussion at this un-conference-style event. Productive conversations across the two days covered data visualisation, data manipulation, text analytics, digital humanities and even data sonification. We hope to hear more from Julie Weeds and others when the LDNA team return to Brighton in September.
Next week, we’ll be calling on colleagues at the HRI to talk us through their experience visualising complex humanities data. Richard Ward (Digital Panopticon) and Dirk Rohman (Migration of Faith) have agreed to walk us through their decision-making processes, and talk through the role of different visualisations in exploring, analysing, and explaining current findings.
Where we’re going
The LDNA team are also gearing up for a summer of presentations:
- Justyna Robinson will be representing LDNA at Sociolinguistics Symposium (Murcia, 15-18 June), as well as sharing the latest analysis from her longitudinal study of semantic variation focused on polysemous adjectives in South Yorkshire speech. Catch LDNA in the general poster session on Friday (17th), and Justyna’s paper at 3pm on Thursday. #SS21
- Susan Fitzmaurice is in Saarland, as first guest speaker at the Historical Corpus Linguistics event hosted by the IDeaL research centre, also on Thursday (16th June) at 2:15pm. Her paper is subtitled “Discursive semantics and the quest for the automatic identification of concepts and conceptual change in English 1500-1800”. #IDeaL
- In July, the Glasgow LDNA team are Krakow-bound for DH2016 (11-16 July). The LDNA poster, part of the Semantic Interpretations group, is currently allocated to Booth 58 during the Wednesday evening poster session. Draft programme.
- Later in July, Iona heads to SHARP 2016 in Paris (18-22). This year, the bi-lingual Society are focusing on “Languages of the Book”, with Iona’s contribution drawing on her doctoral research (subtitle: European Borrowings in 16th and 17th Century English Translations of “the Book of Books”) and giving attention to the role of other languages in concept formation in early modern English (a special concern for LDNA’s work with EEBO-TCP).
- In August, Iona is one of several Sheffield early modernists bound for the Sixteenth Century Society Conference in Bruges. In addition to a paper in panel 241, “The Vagaries of Translation in the Early Modern World” (Saturday 20th, 10:30am), Iona will also be hosting a unique LDNA poster session at the book exhibit. (Details to follow)
- The following week (22-26 August), Seth, Justyna and Susan will be at ICEHL 19 in Essen. Seth and Susan will be talking LDNA semantics from 2pm on Tuesday 23rd.
Back in the UK, on 5 September, LDNA (and the University of Sussex) host our second methodological workshop, focused on data visualisation and linguistic change. Invitations to a select group of speakers have gone out, and we’re looking forward to a hands-on workshop using project data. Members of our network who would like to participate are invited to get in touch.
And back in Sheffield, LDNA is playing a key role in the 2016 Digital Humanities Congress, 8-10 September, hosting two panel sessions dedicated to textual analytics. Our co-speakers include contacts from Varieng and CRASSH. Early bird registration ends 30th June.
Conference report: Diachronic corpora and genre in Nottingham
On Friday 8 April 2016, Susan Fitzmaurice and Seth Mehl attended Diachronic corpora, genre, and language change at the University of Nottingham, where Seth gave a paper entitled Automatic genre identification in EEBO-TCP: A multidisciplinary perspective on problems and prospects. The event featured researchers from around the globe, exploring issues in historical data sets; the nature of genre and text types; and modelling diachronic change.
The day’s plenary speeches were engaging and insightful: Bethany Gray spoke about academic writing as a locus of linguistic change, in contrast to the common expectation that change originates in spoken language. This is particularly relevant for those of us working with older historical data, such that written language is our only evidence for change. Thomas Gloning described the Deutsche Textarchiv, and in particular the recent addition to that corpus of the Dingler Corpus, an essential record of written scientific German representing 1820 to 1932. Gloning presented the useful definition of text types or genres as ‘traditions of communicative action’. In analysing such text types, or traditions, it is possible to map syntax and lexis to text functions and topics, though Gloning cautions that some of the most important elements of such mapping are not currently achievable by machines. This is a careful, valuable perspective and approach, which relates to our own (as discussed below).
Other research papers included a presentation by Fabrizio Esposito who, like the Linguistic DNA project, is using distributional semantic methods. His work looks at recent change in White House Press Briefings. Bryan Jurish presented DiaCollo, a powerful tool for analysing and visualising collocation patterns as they change over time in very large data sets. Vaclav Brezina analysed lexical meaning in EEBO-TCP by measuring differences in collocation patterns across overlapping, sliding diachronic windows.
What did LDNA contribute?
LDNA is asking whether specific concepts emerge uniquely in particular genres, and whether and how those concepts are then adopted and adapted in other genres. Genre is a fuzzy concept, representing categories of texts. Such categories are characterised by formal features such as print layout, phonetics, morphosyntax, lexis, and semantics; and functional features such as purpose of composition, reader expectations, and social and cultural contexts. It is productive to distinguish approaches to genre in different contexts. For Early Modern Studies, categories may be inherited in the canon, and questioned and explored in relation to literature, history, or philosophical or cultural studies; corpus linguistics, often seeks a scientifically reproducible approach to genre and aims to learn about language and variation; while Natural Language Processing (NLP)often aims to engineer tools for solving specific tasks. At the Nottingham conference, Seth illustrated his remarks by reflecting on Ted Underwood’s work automatically identifying genres in HathiTrust texts via supervised machine learning. He then laid out the project’s plan of investigating genre (or text types) by categorising Early Modern texts using the outputs of the LDNA processor, alongside other formal text features. This relates to Gloning’s aforementioned assertion that text topic and function might be mapped onto syntax and lexis; in our case, it is a combined mapping of discursive topics or conceptual fields, lexis, morphosyntax, and additional formal features such as the presence of foreign words or the density of punctuation or parts of speech that will allow us to group texts into categories in a relatively data-driven way.
The conference was very well organised by Richard J. Whitt, with a lovely lunch and dinner in which attendees shared ideas and dug further into linguistic issues. Susan and Seth were delighted to participate.
LDNA’s first year: Reflections from RA Seth Mehl
In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.
Semantics and concepts
The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.
My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).
Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.
Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.
Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.
All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.
A Theoretical Background to Distributional Methods (pt. 2 of 2)
In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.
Means of analysis
The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.
Other links: Psycholinguistics and Discourse Analysis
The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.
There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’
In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.
A theoretical background to distributional methods (pt. 1 of 2)
When discussing proximity data and distributional methods in corpus semantics, it is common for linguists to refer to Firth’s famous “dictum”, ‘you shall know a word by the company it keeps!’ In this post, I look a bit more closely at the theoretical traditions from which this approach to semantics in contexts of use has arisen, and the theoretical links between this approach and other current work in linguistics. (For a synopsis of proximity data and distributional methods, see previous posts here, here, and here.)
Language as Use
Proximity data and distributional evidence can only be observed in records of language use, like corpora. The idea of investigating language in use reflects an ontology of language—the idea that language is language in use. If that basic definition is accepted, then the linguist’s job is to investigate language in use, and corpora constitute an excellent source of concrete evidence for language in use in specific contexts. This prospect is central to perhaps the greatest rift in 20th century linguistics: between, on the one hand, generative linguists who argued against evidence of use (as a distraction from the mental system of language), and, on the other hand, most other linguists, including those in pragmatics, sociolinguistics, Cognitive Linguistics, and corpus linguistics, who see language in use as the central object of study.
Dirk Geeraerts, in Theories of Lexical Semantics, provides a useful, concise summary of the theoretical background to distributional semantics using corpora. Explicitly, a valuation of language in use can be traced through the work of linguistic anthropologist Bronislaw Malinowsky, who argued in the 1930s that language should only be investigated, and could only be understood, in contexts of use. Malinowsky was an influence on Firth, who in turn influenced the next generation of British corpus linguists, including Michael Halliday and John Sinclair. Firth himself was already arguing in the 1930s that ‘the complete meaning of a word is always contextual, and no study of meaning apart from context can be taken seriously’. Just a bit later, Wittgenstein famously asserted in Philosophical Investigations that linguistic meaning is inseparable from use, an assertion quoted by Firth, and echoed by the the philosopher of language John Austin, who was seminal in the development of linguistic pragmatics. Austin approached language as speech acts, instances of use in complex, real-world contexts, that could only be understood as such. The focus on language in use can subsequently be seen throughout later 20th-century developments in the fields of pragmatics and corpus research, as well as in sociolinguistics. Thus, some of the early theoretical work that facilitated the rise of corpus linguistics, and distributional methods, can first be seen in the spheres of philosophy and even anthropology.
Meaning as Contingent, Meaning as Encyclopedic
In order to argue that lexical co-occurrence in use is a source of information about meaning, we must also accept a particular definition of meaning. Traditionally, it was argued that there is a neat distinction between constant meaning and contingent meaning. Constant meaning was viewed as the meaning related to the word itself, while contingent meaning was viewed as not related to the word itself, but instead related to broader contexts of use, including the surrounding words, the medium of communication, real-world knowledge, connotations, implications, and so on. Contingent meaning was by definition contributed by context; context is exactly what is examined in proximity measures and distributional methods. So distributional methods are today generally employed to investigate semantics, but they are in fact used to investigate an element of meaning that was often not traditionally considered the central element of semantics, but instead a peripheral element.
In relation to this emphasis on contingent meaning, corpus linguistics has developed alongside the theory of encyclopedic semantics. In encyclopedic semantics, it is argued that there is any dividing line between constant and contingent meaning is arbitrary. Thus, corpus semanticists who use proximity measures and distributional approaches do not often argue that they are investigating contingent meaning. Instead, they may argue that they are investigating semantics, and that semantics in its contemporary (encyclopedic) sense is a much broader thing than in its more traditional sense.
Distributional methods therefore represent not only an ontology of language as use, but also an ontology of semantics as including what was traditionally known as contingent meaning.
To be continued…
Having discussed the theoretical and philosophical underpinnings of distributional methods here, I will go on to discuss the practical background of these methods in the next blog post.
Distributional Semantics II: What does distribution tell us about semantic relations?
Distributional Semantics II: What does distribution tell us about semantic relations?
In a previous post, I outlined a range of meanings that have been discussed in conjunction with distributional analysis. The Linguistic DNA team is assessing what exactly it can determine about semantics based on distributional analysis: from encyclopaedic meaning to specific semantic relations. In my opinion, the idea that distributional data indicates ‘semantics’ has generally been a relatively vague one: what exactly about ‘semantics’ is indicated? In this post, I’d like to clarify what distribution can tell us about semantic relations in particular, including synonymy, hyponymy, and co-hyponymy.
In the Natural Language Processing (NLP) sphere, numerous studies have tested the effectiveness of distributional data in identifying semantic relations. Turney and Pantel (2010) provide a useful survey of such studies, many of which involve machine learning, and computer performance on synonymy tests including those found on English language exams. Examples of success on synonymy tests have employed windows of anything from +/-4 words up to +/-150 words, but such studies tend not to test various approaches against each other, and they rarely dissect the notion of synonymy, much less co-hyponymy or other semantic relations.
Only a few studies have tested distributional methods as indicators of specific semantic relations. The Quantitative Lexicology and Variational Linguistics (QLVL) team at KU Leuven has addressed this problem in several papers. For example, Peirsman et al. (2007) looked at evidence for synonymy, hyponymy, and co-hyponymy in proximity data for Dutch. (A hyponym is a word whose meaning is a member of a larger category – for example, a crow and a robin are both types of bird, so crow and robin are both hyponyms of bird, and crow and robin are co-hyponyms of each other, but they are not synonyms of each other). Peirsman et al. looked at raw proximity measures as well as proximity measures that incorporate syntactic dependency information. Their findings demonstrate that in Dutch, synonymy and hyponymy are more readily indicated by proximity analyses that include syntactic dependency. On the other hand, they show that co-hyponymy is most effectively evidenced by raw proximity measures that do not include syntactic information. This finding is a startling result, with fascinating implications for linguistic theory. Why should ignoring syntactic information provide better measures of co-hyponymy? Might English be similar? How about Early Modern English?
I think it is important to note that in Peirsman et al. (ibid.), 6.3% of words that share similar distributional characteristics with a given word, or node, are synonyms with that node, and 4.0% are hyponyms of that node. Put differently, about 94% of words identified by distributional analysis aren’t synonyms, and round 70% of the words elicited in these measures are not semantically related to the node at all. Experienced corpus semanticists will not be surprised by this. But what happens to the majority of words, which aren’t related in any clear way? A computer algorithm will output all significant co-occurrences. Often, the co-occurrences that are not intuitively meaningful are quietly ignored by the researcher. It seems to me that if we are going to ignore such outputs, we must do so explicitly and with complete transparency. But this raises bigger questions: If we trust our methods, why should we ignore counterintuitive outputs? Or are these methods valuable simply as reproducible heuristics? I would argue that we should be transparent about our perspective on our own methods.
Also from QLVL, Heylen et al. (2008a) tests which types of syntactic dependency relations are most effective at indicating synonymy in Dutch nouns, and finds that Subject and Object relations most consistently indicate synonymy, but that adjective modification can give the best (though less consistent) indication of synonymy. In fact, adjective modification can be even better than a combined method using adjective modification and Subject/Object relations. Again, the findings are startling, and fascinating—why would the consideration of Subject/Object relations actually hinder the effective use of adjective modification as evidence of synonymy? The answer is not entirely clear. In a comparable study, Van der Plas and Bouma (2005) found Direct Object relations and adjective modification to be the most effective relations in identifying synonymy in Dutch. Unlike Heylen et al.’s (2008a) findings, Van der Plas and Bouma (2005) found that combining dependency relations improved synonym identification.
Is proximity data more effective in determining the semantics of highly frequent words? Heylen et al. (2008b) showed that in Dutch, high frequency nouns are more likely to collocate within +/-3 words with nouns that have a close semantic similarity, in particular synonyms and hyponyms. Low frequency nouns are less likely to do so. In addition, in Dutch, syntactic information is the best route to identifying synonymy and hyponymy overall, but raw proximity information is in fact slightly better at retrieving synonyms for medium-frequency nouns. This finding, then, elaborates on the finding in Peirsman et al. (2007; above).
How about word class? Peirsman et al. (2008) suggest, among other things, that in Dutch, a window of +/-2 words best identifies semantic similarity for nouns, while +/-4 to 7 words is most effective for verbs.
For Linguistic DNA, it is important to know exactly what we can and can’t expect to determine based on distributional analysis. We plan to employ distributional analysis using a range of proximity windows as well as syntactic information. The team will continue to report on this question as we move forward.
*Castle Arenberg, in the photo above, is part of KU Leuven, home of QLVL and many of the studies cited in this post. (Credit: Juhanson. Licence: CC BY-SA 3.0.)
Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk. 2008a. Automatic synonymy extraction: A Comparison of Syntactic Context Models. In Verberne, Suzan; van Halteren, Hans; Coppen, Peter-Arno (eds), Computational linguistics in the Netherlands 2007. Amsterdam: Rodopi, 101-16.
Heylen, Kris; Peirsman, Yves; Geeraerts, Dirk; Speelman, Dirk. 2008b. Modelling word similarity: An evaluation of automatic synonymy extraction algorithms. In: Calzolari, Nicoletta; Choukri, Khalid; Maegaard, Bente; Mariani, Joseph; Odjik, Jan; Piperidis, Stelios; Tapias, Daniel (eds), Proceedings of the Sixth International Language Resources and Evaluation. Marrakech: European Language Resources Association, 3243-49.
Peirsman, Yves; Heylen, Kris; Speelman, Dirk. 2007. Finding semantically related words in Dutch. Co-occurrences versus syntactic contexts. In Proceedings of the 2007 Workshop on Contextual Information in Semantic Space Models: Beyond Words and Documents, 9-16.
Peirsman, Yves; Heylen, Kris; Geeraerts, Dirk. 2008. Size matters: tight and loose context definitions in English word space models. In Proceedings of the ESSLLI Workshop on Distributional Lexical Semantics, 34-41.
Turney, Peter D. and Patrick Pantel. 2010. From Frequency to Meaning: Vector Space Models of Semantics. Journal of Artificial Intelligence Research 37, 141-188.
van der Plas, Lonneke and Gosse Bouma. 2005. Syntactic Contexts for finding Semantically Similar Words. In Proceedings of CLIN 04.
Naomi Tadmor: Semantic analysis of keywords in context
On 30 October, Prof. Naomi Tadmor led a workshop at the University of Sheffield, hosted by the Sheffield Centre for Early Modern Studies. In what follows, I briefly summarise Tadmor’s presentation, and then provide some reflections related to my own work, and to Linguistic DNA.
The key concluding points that Tadmor forwarded are, I think, important for any work with historical texts, and thus also crucial to historical research:
- Understanding historical language (including word meaning) is necessary for understanding historical texts
- To understand historical language we must analyse it in context.
- Analysing historical language in context requires close reading.
Whether we identify as historians, linguists, corpus linguists, literary scholars, or otherwise, we would do well to keep these points in mind.
Tadmor’s take on historical keywords
Tadmor’s specific arguments in the master class focused on kinship terms. In Early Modern English (EModE), there was a broad array of referents for kinship terms such as brother, mother, father, sister, and associated terms like family and friend, which are not likely to be intuitive to a speaker of Present Day English (PDE). Evidence shows, for example, that family often referred to all of the individuals living in a household, including servants, to the possible exclusion of biological relations living outside of the household. The paper Tadmor asked us to read in advance (first published in 1996), supplemented with other examples at the masterclass, provides extensive illustrations of the nuance of family and other kinship terms.
In EModE, there was also a narrow range of semantic or pragmatic implications related to kinship terms: these meanings generally involved social expectations, social networks, or social capital. So, father could refer to ‘biological father’ or ‘father-in-law’ (or even ‘King’), and implied a relationship of social expectation (rather than, for example, a relationship of affection or intimacy, as might be implied in PDE).
By identifying both the array of referents and the implications or senses conveyed by these kinship terms, Tadmor provides a thorough illustration of the terms’ lexical semantics. We can see this method as being motivated by historical questions (about the nature of Early Modern relationships); driven in its first stage by lexicology (insofar as it begins by asking about words, their referents, and senses); and then, in a final stage, employing lexicological knowledge to analyse texts and further address the initial historical questions. Tadmor avoids circularity by using one data set (in her 1996 paper) to identify a hypothesis regarding lexical semantics, and another data set to test her hypothesis. What do these observations about lexical semantics tell us about history? As Tadmor notes, it is by identifying these meanings that we can begin to understand categories of social actions and relationships, as well as motivations for those actions and relationships. Perhaps more fundamentally, it is only by understanding semantics in historical texts, that we can begin to understand the texts meaningfully.
A Corpus Linguist’s take on Tadmor’s methods
Reflecting on Tadmor’s talk, I’m reminded of the utility of the terms semasiology and onomasiology. In semantic research, semasiology is an approach which examines a term as an object of inquiry, and proceeds to identify the meanings of that word. Onomasiology is an approach which begins with a meaning, and then identifies the various terms for expressing it. Tadmor’s method is largely semasiological, insofar as it looks at the meanings of the term family and other kinship terms. This approach begins in a relatively straightforward way—find all of the instances of the word (or lemma), and you can then identify its various senses. The next step is more difficult: how do you distinguish its senses? In linguistics, a range of methods is available, with varying degrees of rigour and reproducibility, and it is important that these methods be outlined clearly. Tadmor’s study is also onomasiological, as she compares the different ways (often within a single text) of referring to a given member of the household family. This approach is less straightforward: how do you identify each time a member of the family is referred to? Again, a range of methods is available, each with its own advantages and disadvantages. A clear statement and justification of the choice of method renders any study more rigorous. In my experience, the systematicity of thinking in terms of onomasiology and semasiology is useful in developing a systematic and rigorous study.
Semasiology and onomasiology allow us to distinguish types of study and approaches to meaning, which can in turn help render our methods more explicit and clear. Similarly, distinguishing editorially between a word (e.g. family) and a meaning (e.g. ‘family’) is useful for clarity. Indeed, thinking methodologically in terms of semasiology and onomasiology encourages clarity of expression editorially regarding terms and meanings. In Tadmor’s 1996 paper, double quotes (e.g. “family”) are used to refer to either the word family or the meaning ‘family’ at various points. At times, such a paper could be rendered more clear, it seems to me, by adopting consistent editorial conventions like those used in linguistics (e.g. quotes or all caps for meanings, italics for terms). The distinction between a term and a meaning is by nature not always clear or certain: that difficulty is all the more reason for journals to adhere to rigorously defined editorial conventions.
From the distinction between terms and concepts, we can move to the distinction between senses and referents. It is important to be explicit both about changes in referent and changes in sense, when discussing semasiological change. For example, as historians and linguists, we must be sure that when we identify changes in a word’s referents (e.g. father referring to ‘father-in-law’), we also identify whether there are changes in its sense (e.g. ‘a relationship of social expectation’ or ‘a relationship of affection and intimacy’). When Thomas Turner refers to his father-in-law as father, he seems to be using the term, as identified by Tadmor, in its Early Modern sense implying ‘a relationship of social expectation’ rather than in the possible PDE sense implying ‘a relationship of affection and intimacy’. The terms referent and sense allow for this distinction, and are useful in practice when conducting this kind of semantic analysis.
Of course, if a term becomes polysemous, it can be applied to a new range of referents, with a new sense, or even with new implicatures or connotations. For example, we can imagine (perhaps counterfactually) a historical development in which family might have come to refer to cohabitants who were not blood relations. At the same time, in referring to those cohabitants who were not blood relations, family might have ceased to imply any kind of social expectation, social network, or social capital. That is, it’s possible for both the referent and the sense to change. In this case, as Tadmor has shown, that doesn’t seem to be what’s happened, but it’s important to investigate such possible polysemies.
Future possibilities: Corpus linguistics
As a corpus linguist, I’d be interested in investigating Tadmor’s semantic findings via a quantitative onomasiological study, looking more closely at selection probabilities. Such a study could ask research questions like:
- Given that an Early Modern writer is expressing ‘nuclear family’, what is the probability of using term a, b, etc., in various contexts?
- Given that a writer is expressing ‘household-family’, what is the probability of using term a, b, etc., in various contexts?
- Given that a writer is expressing ‘spouse’s father’ or ‘brother’s sister’, etc., what is the probability of using term a, b, etc., in various contexts?
These onomasiological research questions (unlike semasiological ones) allow us to investigate logical probabilities of selection processes. This renders statistical analyses more robust. Changes in probabilities of selection over time are a useful illustration of onomasiological change, which is an essential part of semantic change.
And for Linguistic DNA?
For Linguistic DNA, I see (at least) two major questions related to Tadmor’s work:
- Can automated distributional analysis uncover the types of phenomena that Tadmor has uncovered for family?
- What is a concept for Tadmor, and how can her work inform our notion of a concept?
In response to the first question, it is certainly possible that distributional analysis can reflect changing referents (such as ‘father-in-law’ referred to as father). Hypothetically, the distribution of father with a broad array of referents might entail a broad array of lexical co-occurrences. In practice, however, this might be very, very difficult to discern. Hence Tadmor’s call for close reading. It is perhaps more likely that the sense (as opposed to referent) of father as ‘a relationship involving social expectations’ might be reflected in co-occurrence data: hypothetically, father might co-occur with words related to social expectation and obligation. We have evidence that semantically related words tend to constitute only about 30% of significant co-occurrences. Optimistically, it might be that the remaining 70% of words do suggest semantic relationships, if we know how to interpret them—in this case, maybe some co-occurrences with family would suggest the referents or implications discussed here. Pessimistically, it might be that if only 30% of co-occurring words are semantically related, then there would be an even lower probability of finding co-occurring words that reveal such fine semantic or pragmatic nuances as these. Thanks to Tadmor’s work, Linguistic DNA might be able to use family as a test case for what can be revealed by distributional analysis.
What is a concept? Tadmor (1996) doesn’t define concept, and sometimes switches quickly, for example, between discussing the concept ‘family’ and the word family, which can be tricky to follow. At times, concept for Tadmor seems to be similar to definition—a gloss for a term. At other times, concept seems to be broader, suggesting something perhaps with psycholinguistic reality, a sort of notion or idea held in the mind that might relate to an array of terms. Or, concept seems to relate to discourses, to shared social understandings that are shaped by language use. Linguistic DNA is paying close attention to operationalising and/or defining concept in its approach to conceptual and semantic change in EModE. Tadmor’s work points in the same direction that interests us, and the vagueness of concept which Tadmor engages with is vagueness that we are engaging with as well.
Distributional Semantics I: What might distribution tell us about word meaning?
Distributional Semantics I: What might distribution tell us about word meaning?
In a previous post, I asked ‘What is the link between corpus data showing lexical usage, on the one hand, and lexical semantics or concepts, on the other?’ In this post, I’d like to forward that discussion by addressing one component of it: how we observe lexical semantics (or word meaning) via distributional data in texts. That is, how do we know what we know about semantics from distributional data?
Linguists use proximity data from corpora to analyse everything from social implications of discourse, to politeness in pragmatics, to synonymy and hyponymy. Such data is also used by researchers in statistical natural language processing (NLP) for information retrieval, topic identification, and machine learning, among other things. Different researchers tend to use such data towards different ends: for some NLP researchers, it is enough to engineer a tool that produces satisfactory outputs, regardless of its implications for linguistic theory. For sociolinguists and discourse analysts, the process is often one of identifying social or behavioural trends as represented in language use (cf. Baker et al. 2013, Baker 2006). Despite the popularity of studies into meaning and corpora, the question of precisely what sorts of meaning can or can’t be indicated by such data remains remarkably under-discussed.
So, what aspects of meaning, and of word meaning in particular, might be indicated by proximity data?
Many introductory books on corpus semantics would seem to suggest that if you want to know what kinds of word meaning can be indicated by proximity data and distributional patterns, examining a list of co-occurring words, or words that occur in similar contexts, is a good start. Often, the next step (according to the same books) is to look closely at the words in context, and then to perform a statistical analysis on the set of co-occurrences. The problem arises in the last step. All too often, the results are interpreted impressionistically: which significant co-occurrences are readily interpretable in relation to your research questions? You may see some fascinating and impressive things, or you may not, and it’s too easy to disregard outputs that don’t seem relevant on the surface.
An operation like that described above lacks rigour in multiple ways. To disregard outputs that aren’t obviously relevant is to ignore what is likely to be some of the most valuable information in any corpus study (or in any scientific experiment). In addition, the method skips the important step of accounting for the precise elements of meaning in question, and how (or indeed whether) those elements might be observed in the outputs.
In Early Modern English, an analysis of proximity data might (hypothetically) show a significant similarity between the terms abode and residence. Such pairs are straightforward and exciting: we can readily see that we have automatically identified near-synonyms.
Often, researchers are looking to identify synonymy. But that’s not all: researchers might also be after hyponymy, co-hyponymy, antonymy, meronymy, auto-hyponymy, polysemy, or conceptual or discursive relations). In addition, as Geeraerts (2010: 178) points out, we might want to find out specific details about what a noun referent looks like, for example. Can we retrieve any of that information (reliably or consistently) from distributional data, i.e. from co-occurrences in texts?
Examples like abode and residence aren’t the norm. We also see examples like build and residence. What is the meaning relation here? Action and undergoer? A conceptual field related to building residences? Something else entirely?
And what about other pairs of terms with no clear semantic relation whatsoever? Do we disregard them? Impressionistically, it’s easy to pick out the instances of synonymy, or even relationships like Action/Undergoer or Agent/Patient, and to ignore the huge number of semantically unrelated collocates (or collocates with less obvious relations). But that’s not a terribly rigorous method.
By definition, we know that in proximity data, we are observing words that co-occur. Which leaves us to test what kinds of semantic relations are actually indicated, quantitatively, by co-occurrence. This moves us from the vague statement that words are known by the company they keep, towards a scientific account of the relationship between co-occurrence and semantic relations. In the next post (coming soon), I report on exactly that.
Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.
Baker, P. Gabrielatos, C. and McEnery. T. (2013) Discourse Analysis and Media Attitudes: The Representation of Islam in the British Press. Cambridge: Cambridge University Press
Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: Oxford University Press.