When the Linguistic DNA project was first conceived, we aimed to incorporate more than 200 000 items from Eighteenth Century Collections Online (ECCO). Comparing findings for one portion of ECCO that has been digitised in different ways, this 2016 blogpost details why that ambition proved impractical. The public database uses ECCO-TCP as its main eighteenth-century source. Continue reading
Linguistic DNA Co-Investigator Justyna Robinson attended the Sociolinguistics Symposium at the University of Murcia, Spain, 15-18 June. This year’s conference theme was ‘attitudes and prestige’, and the event included over 1,000 presentations. Justyna represented LDNA with a poster in the general poster session entitled ‘Linguistic DNA: Modelling concepts and semantic change in English, 1500-1800’. Below, she reflects on her experience:
For the LDNA project, one of the really important panel sessions was the one organised by Terttu Nevalainen and Marijke van der Wal, entitled Historical sociolinguistics: Dispelling myths about the past. The session included papers which aimed at revisiting a range of assumptions about the past and the study of the past that are not supported by historical sociolinguistic research. In doing so, particularly important for LDNA, were papers of a methodological nature in which methodologies of historical linguistic research were interrogated. For example, in ‘People, work, values: Tracing societal change through linguistic shifts’, Minna Palander-Collin, Anni Sairio, Minna Nevala, and Brendan Humphries (University of Helsinki) explored social changes in Britain between 1750 and 1900 by analysing keywords within the conceptual domains of PEOPLE, WORD, and VALUES. Questions that emerged from the discussion of this research included whether social shifts can be identified in keywords. This quickly led to asking what concepts are and what kind of relationship exists between keywords and concepts. Although the answer to this question wasn’t decided, there was a unanimous desire to explore the question further. Another observation from Palander-Collin et al.’s talk was that certain concepts can linger on in language when in practice the real–life referents designated by the concepts may be long gone. Miriam Meyerhoff added that this issue was also observed in New Zealand data, i.e. researchers looking at Maori keywords found out that references to certain plants lingered on in narratives of a community, well after the time these plants were used. In this discussion the audience continued to reference the LDNA project as well. It was great to hear that more and more people know about LDNA and are following our progress.
The LDNA poster presentation was set up in a beautiful setting in one of the cloisters at Murcia University. The poster attracted a lot of attention. In it, we presented first findings from using positive and negative PMI values to model discursive concepts around the word soldier in the window of +/-100 words. Having set such a large proximity window. we did not initially know whether what we would find would be interesting and useful in our quest to determine what concepts are. One conclusion from this analysis was that large proximity windows still yield meaningful information and clear semantic domains emerge that are important in grasping the discursive concepts around soldier. Another methodological finding of this research is the value of using negative PMI values in improving our understanding of what concepts are. Thus, soldier shows a notably rare association with a group or items that are a semantically cohesive group. These include religious terms, such as sin and church. One may ask whether this systematic weak correlation may indicate the end of a disappearing concept or the beginning of the development of a new concept. These questions will be soon answered by looking at our data diachronically.
Abstracts from the conference are available from the conference website.
The workshop day, titled ‘Semantic Spaces at the Intersection of NLP, Physics, and Cognitive Science’, was part of a larger Quantum Physics and Logic (QPL) conference held at the University of Strathclyde. The workshop focussed on computational approaches to modelling semantics and semantic relations in language. The day was divided into three parts: the first session was concerned with the application of principles derived from physics and formal logic to the expression of linguistic phenomena; the middle section segued this into consideration of Natural Language Processing (NLP); whilst the final section covered cognitive science and cognitive linguistics’ views of semantics. My interest in attending this was to get an idea of the approach which ‘hard science’ is taking to aspects of semantics which overlap with the research of the Linguistic DNA project, as well as to see if there were anything that we might be able to apply to our own work.
A subject which struck a chord was the discussion of vector space modelling, which is near the top of our list of topics to be implemented as we approach the point where we move from identifying word pairs to establishing clusters of related words. The subject was touched on in several of the papers, with particular relevance to the final paper of the day, in which Stephen McGregor described work done by himself and colleagues to locate ‘subspaces’ within vector space models which delineate an analogical relationship between different words. Beginning with an SAT-style statement that ‘dog is to cat as puppy is to kitten’, the paper used PMI measurements as a basis on which to plot these words in vector space, and then examined the geometrical relationship of the points to demonstrate how it might be possible to define a subspace within the vector space and thus automatically identify the positions of analogical partners words or concepts.
The NLP section of the workshop was dominated by Categorical Compositional Distributional semantics and the ways in which researchers using this approach are mapping the emergence of meaning from syntactic structure. The morning’s physics papers had discussed in some detail the application of formal logic expressions to sentence semantics, describing, for example, the way in which a transitive verb combines with subject and object nouns to ‘output’ the meaning of the sentence. These papers applied this theoretical approach to specific sentence elements, such as Dimitri Kartsaklis’ analysis of coordination and Mehrnoosh Sadrzadeh’s study of quantifiers. To me, these papers chimed with work Seth has been doing, considering the importance of handling different parts of speech in different ways during processing; they made clear the flaws in the so-called ‘bag-of-words’ approach to computational linguistics and highlighted that, in the long run, consideration of syntax should be an important part of the kind of computational semantics we’re undertaking.
Also of special interest was Peter Gärdenfors’ consideration of domains as components of word meanings. In the main, the point was illustrated through consideration of nouns (although touching on other parts of speech), asking whether it might be helpful to think of words as fundamentally dependent on domains such as place, shape, and temperature (so that, for example, ‘round’ maintains some connection to the sense of a curve in physical space even when not used as a noun, whilst most ‘verbed’ nouns retain important connections to their parent’s referent). Whilst bearing mostly indirect applicability to current LDNA work, this discussion is important food for thought, especially for its potential impact on the encyclopedic aspect of a word’s semantics in context.
The workshop provided a thought-provoking day to a relative outsider, offering an important viewpoint on the other approaches to semantics which are being pioneered outside of arts faculties, an awareness which can only strengthen our own work. I’d like to thank the organisers and contributors to the workshop for a hugely interesting and intellectually engaging day.
For the past couple of months, our rolling horizon has looked increasingly full of activity. This new blogpost provides a brief update on where we’ve been and where we’re going. We’ll be aiming to give more thorough reports on some of these activities after the events.
Where we’ve been
In May, Susan, Iona and Mike travelled to Utrecht, at the invitation of Joris van Eijnatten and Jaap Verheul. Together with colleagues from Sheffield’s History Department, we presented the different strands of Digital Humanities work ongoing at Sheffield. We learned much from our exchanges with Utrecht’s AsymEnc and Translantis research programs, and enjoyed shared intellectual probing of visualisations of change across time. We look forward to continued engagement with each others’ work.
A week later, Seth and Justyna participated in This&THATCamp at the University of Sussex (pictured), with LDNA emerging second in a popular poll of topics for discussion at this un-conference-style event. Productive conversations across the two days covered data visualisation, data manipulation, text analytics, digital humanities and even data sonification. We hope to hear more from Julie Weeds and others when the LDNA team return to Brighton in September.
Next week, we’ll be calling on colleagues at the HRI to talk us through their experience visualising complex humanities data. Richard Ward (Digital Panopticon) and Dirk Rohman (Migration of Faith) have agreed to walk us through their decision-making processes, and talk through the role of different visualisations in exploring, analysing, and explaining current findings.
Where we’re going
The LDNA team are also gearing up for a summer of presentations:
- Justyna Robinson will be representing LDNA at Sociolinguistics Symposium (Murcia, 15-18 June), as well as sharing the latest analysis from her longitudinal study of semantic variation focused on polysemous adjectives in South Yorkshire speech. Catch LDNA in the general poster session on Friday (17th), and Justyna’s paper at 3pm on Thursday. #SS21
- Susan Fitzmaurice is in Saarland, as first guest speaker at the Historical Corpus Linguistics event hosted by the IDeaL research centre, also on Thursday (16th June) at 2:15pm. Her paper is subtitled “Discursive semantics and the quest for the automatic identification of concepts and conceptual change in English 1500-1800”. #IDeaL
- In July, the Glasgow LDNA team are Krakow-bound for DH2016 (11-16 July). The LDNA poster, part of the Semantic Interpretations group, is currently allocated to Booth 58 during the Wednesday evening poster session. Draft programme.
- Later in July, Iona heads to SHARP 2016 in Paris (18-22). This year, the bi-lingual Society are focusing on “Languages of the Book”, with Iona’s contribution drawing on her doctoral research (subtitle: European Borrowings in 16th and 17th Century English Translations of “the Book of Books”) and giving attention to the role of other languages in concept formation in early modern English (a special concern for LDNA’s work with EEBO-TCP).
- In August, Iona is one of several Sheffield early modernists bound for the Sixteenth Century Society Conference in Bruges. In addition to a paper in panel 241, “The Vagaries of Translation in the Early Modern World” (Saturday 20th, 10:30am), Iona will also be hosting a unique LDNA poster session at the book exhibit. (Details to follow)
- The following week (22-26 August), Seth, Justyna and Susan will be at ICEHL 19 in Essen. Seth and Susan will be talking LDNA semantics from 2pm on Tuesday 23rd.
Back in the UK, on 5 September, LDNA (and the University of Sussex) host our second methodological workshop, focused on data visualisation and linguistic change. Invitations to a select group of speakers have gone out, and we’re looking forward to a hands-on workshop using project data. Members of our network who would like to participate are invited to get in touch.
And back in Sheffield, LDNA is playing a key role in the 2016 Digital Humanities Congress, 8-10 September, hosting two panel sessions dedicated to textual analytics. Our co-speakers include contacts from Varieng and CRASSH. Early bird registration ends 30th June.
Back in 2012, HRI Digital ran a project, with the departments of English, History, and Sociological Studies, looking at participatory search design. The project took as its focus a subset of George Thomason’s 17th-century newsbooks, transcribing every issue of Mercurius Politicus plus the full selection of newsbooks published in 1649 (from the images available through ProQuest’s Early English Books Online). Building the interactive interface, the Newsbooks project focused on how researchers interact with (and want to interact with) such historical texts. Thus, for example, search results may feature texts published at the same point in time. A problem not resolved in the original phase was variant spellings, and the humanities investigators held onto concerns about (in)accuracy in the transcriptions.
The tools tried out for Linguistic DNA have provided a fresh mechanism to improve the Newsbooks’ searchability. Sheffield MA student Amy Jackson recently completed a 100-hour work placement investigating how a MorphAdorned version of the Newsbooks could inform questions about the accuracy of transcriptions, and how a statistically-organised representation of the language data (an early output of LDNA’s processor) affects understanding of the content and context of Thomason’s collection.
My main task during my placement has been to find errors within the newsbooks, both printing and transcription errors, in order to improve the searchability of the newsbooks. I’ve been using methods such as checking hapax legomena (words that only occur once within a text or collection of texts) and Pointwise Mutual Information (PMI).
Note from the editors: PMI measures word associations by comparing observed cooccurrences with what might be expected in a random wordset (based on the same data).
—Expect a blog post on this soon!
The hurried composition of the newsbooks causes problems for searchability. It seems those printing the newsbooks were less concerned with accuracy than those who were printing books. This can be seen in several examples that I have found while searching through the hapax legomena. For example, on one occasion ‘transmitted’ is printed as ‘trasmitte4’ with a ‘4’ being used as a substitute for a missing ‘d’ (see image above). Elsewhere the number ‘8’ is used as a substitute for a capital ‘S’, printing ‘Sea’ as ‘8ea’. Such printing decisions present a specialised problem for searches because they are unusual. Knowing this characteristic (replacing letters with numbers) means one can look at modifying search rules to improve the ‘success’ in finding relevant information.
High PMI values can also be used to find unusual words or word pairs that aren’t errors. While I was searching through the high PMI values I came across the word ‘King-chopper’ – used as an insult to refer to Colonel John ‘Tinker’ Fox who was falsely rumoured to be one of King Charles I’s executioners in 1649. The Man in the Moon, the newsbook in which the reference appears, was printed by John Crouch. Crouch was a Royalist journalist who was arrested and imprisoned for printing The Man in the Moon after the King’s death.
Mid-range PMI values are useful for understanding how language was used in the newsbooks. ‘Loyal’ often co-occurs with words such as ‘crown’, ‘royalist’, ‘sovereign’, ‘majesty’, ‘Charles’, ‘usurp’, and ‘treason’. This implies that the word ‘loyal’ is mainly being used by Royalist newsbooks in 1649 rather than Parliamentarian newsbooks. If I had more time I would look more closely at the differences in the language used by Royalist and Parliamentarian newsbooks.
PMI and hapax legomena have been useful for finding errors within the newsbooks but they have mainly provided an interesting way for me to interact with the texts. The PMI data often encouraged me to research the newsbooks and the people who printed them further and hapax legomena have provided useful insights into how the newsbooks were printed in 1649.
On Friday 8 April 2016, Susan Fitzmaurice and Seth Mehl attended Diachronic corpora, genre, and language change at the University of Nottingham, where Seth gave a paper entitled Automatic genre identification in EEBO-TCP: A multidisciplinary perspective on problems and prospects. The event featured researchers from around the globe, exploring issues in historical data sets; the nature of genre and text types; and modelling diachronic change.
The day’s plenary speeches were engaging and insightful: Bethany Gray spoke about academic writing as a locus of linguistic change, in contrast to the common expectation that change originates in spoken language. This is particularly relevant for those of us working with older historical data, such that written language is our only evidence for change. Thomas Gloning described the Deutsche Textarchiv, and in particular the recent addition to that corpus of the Dingler Corpus, an essential record of written scientific German representing 1820 to 1932. Gloning presented the useful definition of text types or genres as ‘traditions of communicative action’. In analysing such text types, or traditions, it is possible to map syntax and lexis to text functions and topics, though Gloning cautions that some of the most important elements of such mapping are not currently achievable by machines. This is a careful, valuable perspective and approach, which relates to our own (as discussed below).
Other research papers included a presentation by Fabrizio Esposito who, like the Linguistic DNA project, is using distributional semantic methods. His work looks at recent change in White House Press Briefings. Bryan Jurish presented DiaCollo, a powerful tool for analysing and visualising collocation patterns as they change over time in very large data sets. Vaclav Brezina analysed lexical meaning in EEBO-TCP by measuring differences in collocation patterns across overlapping, sliding diachronic windows.
What did LDNA contribute?
LDNA is asking whether specific concepts emerge uniquely in particular genres, and whether and how those concepts are then adopted and adapted in other genres. Genre is a fuzzy concept, representing categories of texts. Such categories are characterised by formal features such as print layout, phonetics, morphosyntax, lexis, and semantics; and functional features such as purpose of composition, reader expectations, and social and cultural contexts. It is productive to distinguish approaches to genre in different contexts. For Early Modern Studies, categories may be inherited in the canon, and questioned and explored in relation to literature, history, or philosophical or cultural studies; corpus linguistics, often seeks a scientifically reproducible approach to genre and aims to learn about language and variation; while Natural Language Processing (NLP)often aims to engineer tools for solving specific tasks. At the Nottingham conference, Seth illustrated his remarks by reflecting on Ted Underwood’s work automatically identifying genres in HathiTrust texts via supervised machine learning. He then laid out the project’s plan of investigating genre (or text types) by categorising Early Modern texts using the outputs of the LDNA processor, alongside other formal text features. This relates to Gloning’s aforementioned assertion that text topic and function might be mapped onto syntax and lexis; in our case, it is a combined mapping of discursive topics or conceptual fields, lexis, morphosyntax, and additional formal features such as the presence of foreign words or the density of punctuation or parts of speech that will allow us to group texts into categories in a relatively data-driven way.
The conference was very well organised by Richard J. Whitt, with a lovely lunch and dinner in which attendees shared ideas and dug further into linguistic issues. Susan and Seth were delighted to participate.
In wrapping up the first year of LDNA, I’ve taken a moment to consider some of the over-arching questions that have occupied much of my creative and critical faculties so far. What follows is a personal reflection on some issues that I’ve found especially exciting and engaging.
Semantics and concepts
The Linguistic DNA project sets out to identify ‘semantic and conceptual change’ in Early Modern English texts, with attention to variation too, particularly in the form of semantic and conceptual variation across text types. The first questions, for me, then, were what exactly constitutes semantics and what we mean when we say concept. These are, in part, abstract questions, but they must also be defined in terms of practical operations for computational linguistics. Put differently, if semantics and concepts are not defined in terms of features that can be identified automatically by computer, then the definitions are not terribly useful for us.
My first attempt at approaching semantics and concepts for the project began with synonymy, then built up to onomasiological relationships, and then defined concepts as networks of onomasiological relationships. Following Kris Heylen’s visit, I realised just how similar this approach was to the most recent QLVL work. My next stab at approaching these terms moved towards an idea of encyclopaedic meaning inspired in part by the ‘encyclopaedic semantics’ of Cognitive Linguistics, and related to sets of words in contexts of use. This approach seemed coherent and effective. We have since come to define concepts, for our purposes, as discursive, operating at a level larger than syntactic relations, phrases, clauses, or sentences, but smaller than an entire text (and therefore dissimilar from topic modelling).
Given that the project started without a definition of semantics and concept, it follows that the operationalisation of identifying those terms had not been laid out either. As a corpus semanticist, the natural start for me was to sort through corpus methods for automatic semantic analysis, including collocation analysis, second-order collocations, and vector space models. We continue to explore those methods by sorting through various parameters and variables for each. Most importantly, we are working to analyse our data in terms of linguistically meaningful probabilities. That is, we are thinking about the co-occurrence of words not simply as data points that might arise randomly, but as linguistic choices that are rarely, if ever, random. This requires us to consider how often linguistic events such as lexical co-occurrences actually arise, given the opportunity for them to arise. If we hope to use computational tools to learn about language, then we must certainly ensure that our computational approaches incorporate what we know about language, randomness, and probability.
Equally important was the recognition that although we are using corpus methods, we are not working with corpora, or at least not with corpora as per standard definitions. I define a corpus as a linguistic data-set sampled to represent a particular population of language users or of language in use. Corpus linguists examine language samples in order to draw conclusions about the populations they represent. EEBO and ECCO are, crucially, not sampled to represent populations—they are essentially arbitrary data sets, collected on the basis of convenience, of texts’ survival through history, and of scholarly interest and bias, among other variables. It is not at all clear that EEBO and ECCO can be used to draw rigorous conclusions about broader populations. Within the project, we often refer to EEBO and ECCO as ‘universes of printed discourse’, which renders them a sort of population in themselves. From that perspective, we can conclude a great deal about EEBO and ECCO, and the texts they contain, but it is tenuous at best to relate those conclusions to a broader population of language use. This is something that we must continually bear in mind.
Rather than seeing the LDNA processor as a tool for representing linguistic trends across populations, I have recently found it more useful to think of our processor primarily as a tool to aid in information retrieval: it is useful for identifying texts where particular discursive concepts appear. Our tools are therefore expected to be useful for conducting case studies of particular texts and sets of texts that exemplify particular concepts. In a related way, we use the metaphor of a topological map where texts and groups of texts exemplifying concepts rise up like hills from the landscape of the data. The processor allows us to map that topography and then ‘zoom in’ on particular hills for closer examination. This has been a useful metaphor for me in maintaining a sense of the project’s ultimate aims.
All of these topics represent ongoing developments for LDNA, and one of the great pleasures of the project has been the engaging discussions with colleagues about these issues over the last year.
In 2016, Dr Kris Heylen (KU Leuven) spent a week in Sheffield as a HRI Visiting Fellow, demonstrating techniques for studying change in “lexical concepts” and encouraging the Linguistic DNA team to articulate the distinctive features of the “discursive concept”.
Earlier this month, the Linguistic DNA project hosted Dr Kris Heylen of KU Leuven as a visiting fellow (funded by the HRI Visiting European Fellow scheme). Kris is a member of the Quantitative Lexicology and Variational Linguistics (QLVL) research group at KU Leuven, which has conducted unique research into the significance of how words cooccur across different ‘windows’ of text (reported by Seth in an earlier blogpost). Within his role, Kris has had a particular focus on the value of visualisation as a means to explore cooccurrence data and it was this expertise from which the Linguistic DNA project wished to learn.
Kris and his colleagues have worked extensively on how concepts are expressed in language, with case studies in both Dutch and English, drawing on data from the 1990s and 2000s. This approach is broadly sympathetic to our work in Linguistic DNA, though we take an interest in a higher level of conceptual manifestation (“discursive concepts”), whereas the Leuven team are interested in so-called “lexical concepts”.
In an open lecture on Tracking Conceptual Change, Kris gave two examples of how the Leuven techniques (under the umbrella of “distributional semantics”) can be applied to show variation in language use, according to context (e.g. types of newspaper) and over time. A first case study explored the notion of a ‘person with an immigration background’ looking at how this was expressed in high and low brow Dutch-language newspapers in the period from 1999 to 2005. The investigation began with the word allochtoon, and identified (through vector analysis) migrant as the nearest synonym in use. Querying the newspaper data across time exposed the seasonality of media discourse about immigration (high in spring and autumn, low during parliamentary breaks or holidays). It was also possible to document a decrease in ‘market share’ of allochtoon compared with migrant, and—using hierarchical cluster analysis—to show how each term was distributed across different areas of discourse (comparing discussion of legal and labour-market issues, for example). A second comparison examined adjectives of ‘positive evaluation’, using the Corpus of Historical American English (COHA, 1860-present). Organising each year’s data as a scatter plot in semantic space, the path of an adjective could be traced in relation to others—moving closer to or apart from similar words. The path of terrific from ‘frightening’ to ‘great’ provided a vivid example of change through the 1950s and 1960s.
During his visit, Kris explored some of the first outputs from the Linguistic DNA processor, material printed in the British Isles (or in English) in two years, 1649 and 1699, transcribed for the Text Creation Partnership, and further processed with the MorphAdorner tool developed by Martin Mueller and Philip Burns at NorthWestern. Having run this through additional processes developed at Leuven, Kris led a workshop for Sheffield postgraduate and early career researchers and members of the LDNA team in which we learned different techniques for visualising the distribution of heretics and schismatics in the seventeenth-century.
The lecture audience and workshop participants were drawn from fields including English Literature, History, Computer Science, East Asian Studies, and the School of Languages and Cultures. Prompted partly by the distribution of the Linguistic DNA team (located in Sussex and Glasgow as well as Sheffield), both lecture and workshop were livestreamed over the internet, extending our audiences to Birmingham, Bradford, and Cambridge. We’re exceedingly grateful for the technical support that made this possible.
Time was also set aside to discuss the potential for future collaboration with Kris and others at Leuven, including participation of the QLVL team in LDNA’s next methodological workshop (University of Sussex, September 2016) and other opportunities to build on our complementary fields of expertise.
In February 2016, Linguistic DNA hosted Dr Kris Heylen as an HRI Visiting Fellow, strengthening our links with KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. This post outlines the scheduled public events.
Kris is a researcher based in KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. His research focuses on the statistical modelling of lexical semantics and lexical variation, and more specifically the introduction of distributional semantic models into lexicological research. Next to his fundamental research on lexical semantics, he has also a strong interest in exploring the use of quantitative, corpus-based methods in applied linguistic research with projects in legal translation, vocabulary learning and medical terminology.
During his stay in Sheffield, Kris will be working alongside the Linguistic DNA team, playing with some of our data, and sharing his experience of visualizing semantic change across time, as well as talking about future research collaborations with others on campus. There will be several opportunities for others to meet with Kris and hear about his work, including a lecture and workshop (details below). Both events are free to attend.
Lecture: 3 March
On Thursday 3rd March at 5pm, Kris will give an open lecture entitled:
Tracking Conceptual Change:
A Visualization of Diachronic Distributional Semantics
ABSTRACT (Kris writes):
In this talk, I will present an overview of statistical and corpus-based studies of lexical variation and semantic change, carried out at the research group Quantitative Lexicology and Variational Linguistics (QLVL) in recent years. As a starting point, I’ll take the framework developed in Geeraerts et. al. (1994) to describe the interaction between concepts’ variable lexical expression (onomasiology) and lexemes’ variable meaning (semasiology). Next, I will discuss how we adapted distributional semantic models, as originally developed in computational linguistics (see Turney and Pantel 2010 for an overview), to the linguistic analysis of lexical variation and change.
With two case studies, one on the concept of immigrant in Dutch and one on positive evaluative adjectives in English (great, superb, terrific, etc.), I’ll illustrate how we have used visualisation techniques to investigate diachronic change in both the construal and the lexical expression of concepts.
All are welcome to attend this guest lecture which takes place at the Humanities Research Institute (34 Gell Street). It is also possible to come for dinner after the lecture, though places may be limited and those interested are asked to get in touch with Linguistic DNA beforehand (by Tuesday 1st February).
Workshop: 7 March
On Monday 7th March, Kris will run an open workshop on visualizing language, sharing his own experiments with Linguistic DNA data. Participation is open to students and staff, but numbers are limited and advance registration is required. To find out more, please email Linguistic DNA (deadline: 4pm, Friday 4th March). Those at the University of Sheffield can reserve a place at the workshop using Doodle Poll.
Anyone who would like the opportunity to meet with Kris to discuss research collaborations should get in touch with him via Linguistic DNA as soon as possible so that arrangements can be made.
In the previous post, I presented the theoretical and philosophical underpinnings of distributional methods in corpus semantics. In this post, I touch on the practical background that has shaped these methods.
Means of analysis
The emergence of contemporary distributional methods occurs alongside the emergence of Statistical Natural Language Processing (NLP) in the 1990s. Statistical NLP relies on probabilistic methods to represent language, annotate terms in texts, or perform a number of additional tasks such as topic identification or information retrieval. By analysing what actually happens in huge numbers of texts, statistical NLP researchers not only describe naturally occurring language, but also model it and make predictions about it. Corpus semantics is crucially linked to that intellectual development in applied science; specifically, contemporary work with proximity measures and distributional methods in corpus semantics often employs the same computational tools and techniques employed in statistical NLP. The tools are shared, and the underlying stance is shared that a statistical and probabilistic account of language is meaningful. Arguably, other fields in the social sciences (such as psychology), and in the life sciences (such as evolutionary biology), have also been shaped by the rise in statistical and probabilistic methods of representation. Such methods represent an epistemology (and perhaps a discourse) that affects the types of knowledge that are sought and the types of observations that are made in a field.
Other links: Psycholinguistics and Discourse Analysis
The theoretical perspectives outlined above also link corpus semantics, proximity measures, and distributional methods to a larger theoretical framework that includes psycholinguistics and discourse analysis. Frequency of words in use, and frequency of co-occurrence in use, are hypothesised as crucial in human learning and processing of lexical semantics. In very general terms, if we hear or read a word frequently, we’re likely to learn that word more readily and once we’ve learned it, we’re likely to mentally process it more quickly. As noted above, corpora contain valuable frequency data for words in use in specific contexts. Today, corpora are often used as a counterpoint or complement to psycholinguistic research, and many researchers have attempted to model psycholinguistic processes using computational processes including distributional semantics.
There has been a tremendous rise recently in discourse analysis using corpora, and its roots go back at least as far as Sinclair and Stubbs. Discourse analysis itself emerges largely from continental philosophical traditions, particularly Foucault’s definition of discourses as ‘practices which systematically form the objects of which they speak’. These practices are often linguistic, and are studied via linguistic acts, language in use in particular contexts. Such research connects the ontology of language as use with the ontology of meaning as encompassing all of the real-world contexts, topics, etc., that surround a term or a set of terms in use. Corpora allow researchers to ask: ‘Given that speakers or writers are discussing a given term, what other terms do the speakers or writers also discuss, and how do such discussions (as practices or acts) define the objects of which they speak?’
In order to make sense of proximity measures and distributional methods, it is important to grasp the underlying practicalities outlined above, and the broader theoretical framework to which these methods relate (discussed in a previous post). The idea that a word is known by the company it keeps is by no means an a priori fact, but is premised on a framework of linguistics that developed during the 20th century in relation to concurrent developments in philosophy, technology, and the sciences in general.