LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:
Conference reflections jointly written with Justyna Robinson
Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.
Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.
Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.
Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. 
Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised. The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.
However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.
For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.
Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.
As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:
TA 1: Between numbers and words
Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.
Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.
TA 2: Identifying complex meanings in historical texts
Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.
With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.
If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.
In 2016, Dr Kris Heylen (KU Leuven) spent a week in Sheffield as a HRI Visiting Fellow, demonstrating techniques for studying change in “lexical concepts” and encouraging the Linguistic DNA team to articulate the distinctive features of the “discursive concept”.
Earlier this month, the Linguistic DNA project hosted Dr Kris Heylen of KU Leuven as a visiting fellow (funded by the HRI Visiting European Fellow scheme). Kris is a member of the Quantitative Lexicology and Variational Linguistics (QLVL) research group at KU Leuven, which has conducted unique research into the significance of how words cooccur across different ‘windows’ of text (reported by Seth in an earlier blogpost). Within his role, Kris has had a particular focus on the value of visualisation as a means to explore cooccurrence data and it was this expertise from which the Linguistic DNA project wished to learn.
Kris and his colleagues have worked extensively on how concepts are expressed in language, with case studies in both Dutch and English, drawing on data from the 1990s and 2000s. This approach is broadly sympathetic to our work in Linguistic DNA, though we take an interest in a higher level of conceptual manifestation (“discursive concepts”), whereas the Leuven team are interested in so-called “lexical concepts”.
In an open lecture on Tracking Conceptual Change, Kris gave two examples of how the Leuven techniques (under the umbrella of “distributional semantics”) can be applied to show variation in language use, according to context (e.g. types of newspaper) and over time. A first case study explored the notion of a ‘person with an immigration background’ looking at how this was expressed in high and low brow Dutch-language newspapers in the period from 1999 to 2005. The investigation began with the word allochtoon, and identified (through vector analysis) migrant as the nearest synonym in use. Querying the newspaper data across time exposed the seasonality of media discourse about immigration (high in spring and autumn, low during parliamentary breaks or holidays). It was also possible to document a decrease in ‘market share’ of allochtoon compared with migrant, and—using hierarchical cluster analysis—to show how each term was distributed across different areas of discourse (comparing discussion of legal and labour-market issues, for example). A second comparison examined adjectives of ‘positive evaluation’, using the Corpus of Historical American English (COHA, 1860-present). Organising each year’s data as a scatter plot in semantic space, the path of an adjective could be traced in relation to others—moving closer to or apart from similar words. The path of terrific from ‘frightening’ to ‘great’ provided a vivid example of change through the 1950s and 1960s.
During his visit, Kris explored some of the first outputs from the Linguistic DNA processor, material printed in the British Isles (or in English) in two years, 1649 and 1699, transcribed for the Text Creation Partnership, and further processed with the MorphAdorner tool developed by Martin Mueller and Philip Burns at NorthWestern. Having run this through additional processes developed at Leuven, Kris led a workshop for Sheffield postgraduate and early career researchers and members of the LDNA team in which we learned different techniques for visualising the distribution of heretics and schismatics in the seventeenth-century.
The lecture audience and workshop participants were drawn from fields including English Literature, History, Computer Science, East Asian Studies, and the School of Languages and Cultures. Prompted partly by the distribution of the Linguistic DNA team (located in Sussex and Glasgow as well as Sheffield), both lecture and workshop were livestreamed over the internet, extending our audiences to Birmingham, Bradford, and Cambridge. We’re exceedingly grateful for the technical support that made this possible.
Time was also set aside to discuss the potential for future collaboration with Kris and others at Leuven, including participation of the QLVL team in LDNA’s next methodological workshop (University of Sussex, September 2016) and other opportunities to build on our complementary fields of expertise.
In February 2016, Linguistic DNA hosted Dr Kris Heylen as an HRI Visiting Fellow, strengthening our links with KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. This post outlines the scheduled public events.
Kris is a researcher based in KU Leuven’s Quantitative Lexicology and Variational Linguistics research group. His research focuses on the statistical modelling of lexical semantics and lexical variation, and more specifically the introduction of distributional semantic models into lexicological research. Next to his fundamental research on lexical semantics, he has also a strong interest in exploring the use of quantitative, corpus-based methods in applied linguistic research with projects in legal translation, vocabulary learning and medical terminology.
During his stay in Sheffield, Kris will be working alongside the Linguistic DNA team, playing with some of our data, and sharing his experience of visualizing semantic change across time, as well as talking about future research collaborations with others on campus. There will be several opportunities for others to meet with Kris and hear about his work, including a lecture and workshop (details below). Both events are free to attend.
Lecture: 3 March
On Thursday 3rd March at 5pm, Kris will give an open lecture entitled:
Tracking Conceptual Change:
A Visualization of Diachronic Distributional Semantics
ABSTRACT (Kris writes):
In this talk, I will present an overview of statistical and corpus-based studies of lexical variation and semantic change, carried out at the research group Quantitative Lexicology and Variational Linguistics (QLVL) in recent years. As a starting point, I’ll take the framework developed in Geeraerts et. al. (1994) to describe the interaction between concepts’ variable lexical expression (onomasiology) and lexemes’ variable meaning (semasiology). Next, I will discuss how we adapted distributional semantic models, as originally developed in computational linguistics (see Turney and Pantel 2010 for an overview), to the linguistic analysis of lexical variation and change.
With two case studies, one on the concept of immigrant in Dutch and one on positive evaluative adjectives in English (great, superb, terrific, etc.), I’ll illustrate how we have used visualisation techniques to investigate diachronic change in both the construal and the lexical expression of concepts.
All are welcome to attend this guest lecture which takes place at the Humanities Research Institute (34 Gell Street). It is also possible to come for dinner after the lecture, though places may be limited and those interested are asked to get in touch with Linguistic DNA beforehand (by Tuesday 1st February).
Workshop: 7 March
On Monday 7th March, Kris will run an open workshop on visualizing language, sharing his own experiments with Linguistic DNA data. Participation is open to students and staff, but numbers are limited and advance registration is required. To find out more, please email Linguistic DNA (deadline: 4pm, Friday 4th March). Those at the University of Sheffield can reserve a place at the workshop using Doodle Poll.
Anyone who would like the opportunity to meet with Kris to discuss research collaborations should get in touch with him via Linguistic DNA as soon as possible so that arrangements can be made.
The supra-lexical approach to the process of concept recognition that I’ve described depends upon an encyclopaedic perspective on semantics (e.g. cf. Geeraerts, 2010: 222-3). This is fitting as ‘encyclopaedic semantics is an implicit precursor to or foundation of most distributional semantics or collocation studies’ (Mehl, p.c.). However, such studies do not typically pause to model or theorise before conducting analysis of concepts and semantics as expressed lexically. In other words, semasiological (and onomasiological) studies work on the premise of ready-made or at least ready lexicalised concepts, and proceed from there. This means that although they depend upon the prior application of encyclopaedic semantics, they themselves do not need to model or theorise this semantics because it belongs to the cultural messiness that yields the lexical expressions that they then proceed to analyse.
For LDNA, concepts are not discrete or componential lexical semantic meanings; neither are they abstract or ideal. Instead, they consist of associations of lexical/phrasal/constructional semantic and pragmatic meanings in use.
This encyclopaedic perspective suggests the following operationalisation of a concept for LDNA:
- Concepts resemble encyclopaedic meanings (which are temporally and culturally situated chunks of knowledge about the world expressed in a distributed way) rather than discrete or componential meanings. [This coincides with non-modular theories of mind, which adopt a psychological approach to concepts.]
- Concepts can be expressed in texts by (typically a combination of) words, phrases, constructions, or even by implicatures or invited inferences (and possibly by textual absences).
- Concepts are traceable in texts primarily via significant syntagmatic (associative) relations (of words/phrases/constructions/meanings) and secondarily via significant paradigmatic (alternate) relations (of words/phrases/constructions/meanings).
- A concept in a given historical moment might not be encapsulated in any observed word, phrase, or construction, but might instead only be observable via a complete set of words, phrases, or constructions in syntagmatic or paradigmatic relation to each other.
It is worth noting however, that concept recognition is particularly difficult (for the automatic processes built into LDNA) because it ordinarily depends upon the level of cultural literacy possessed by a reader. This is a quality which, while we cannot incorporate it as a process, we can take it into account by testing distant reading through close reading.
As well as being encyclopaedic, our approach is also experiential, in that the conceptual structure of early modern discourse is a reflection of the way early modern people experienced the world around them. That discourse presents a particular subjective view of the world with the hierarchical network of preferences which emerges as a network of concepts in discourse. In this way we also assume a perspectival nature of concept organisation.
Concluding remarks: Testing and tracking conceptual change across time and style
All being well, if we succeed in visualising the results of an iterative and developing set of procedures to inspect the data from these large corpora, we hope to be able to discern and locate the emergence of concepts in the universe of early modern English print. A number of questions arise about where and how these will show up.
For instance, following our hypothesis, will we see the cementation of a concept in the persistent co-occurrence in particular contexts of candidate conjuncts (both binomials and alternates), bigrams, and ultimately, ‘keywords’? (e.g. ‘man of business’ → ‘businessman’ in late Modern English newspapers)
And, as part of the notion of context, it is worth considering the role of discourse genre in the emergence of a concept and in conceptual change. For instance, if it is the case that a concept emerges, not as a keyword, but in the form of an association of expressions that functions as a loose paraphrase, is this kind of process more likely to occur in a specific discourse genre than in general discourse? In other words, is it possible that technical or specialist discourses will be the locus of new concepts, concepts which might diffuse gradually into public and more general ones? (e.g. dogma, law, science → newpapers, narrative, etc.)
What we hope to do is to make our approach manifest and our results visual. For instance, the emergence of a concept might be envisaged as clusters of texts rising up on the terrain representing a certain feature. And the reminder that they might not just gradually change over time, rising and falling across the terrain, but there might instead be islands of certain features that appear in distant time periods, disparate genres, sub-genres. All of that can be identified by the computer, but we have to make sense of it as close readers afterwards.
Geeraerts, Dirk. 2010. Theories of Lexical Semantics. Oxford: OUP.
This blog post features the second of three extracts from Susan Fitzmaurice’s paper on “Concepts and Conceptual Change in Linguistic DNA”. (See previous post.)
Before tackling the problem of actually defining the content of a concept ‘from below’, we need to imagine ourselves into the position of being able to recognize the emergence of material that is a candidate for being considered a concept. Let’s briefly consider the question of ‘when is a concept’; in other words, how will we recognize something that is relevant, resonant and important in historical, cultural and political terms for our periods of interest?
In a manner that is not trivial, we want our research process to perform the discovery work of an innocent reader, a reader who approaches a universe of discourse without an agenda, but with a will to discover what the text yields up as worthy of notice. This innocent reader is an ideal reader of course; as humans are pattern finders, pattern matchers and meaning makers, it is virtually impossible to imagine a process that is truly ab initio. Indeed, a situation in which the reader is not primed to notice specific features, characteristics or meanings by the cotext or broader context is rare indeed.
The aim is for our processes to imitate the intuitive, intelligent scanning that human readers perform as they survey the universe of discourse in which they are interested (literary and historical documents). We assume that readers gradually begin to notice patterns, perhaps prominent combinations or associations, patterns that appear in juxtaposition in some places and in connection in others (Divjak & Gries, 2012). The key process is the noticing in the text the formation of ideas that gather cohesion and content in linguistic expression. We hypothesize that in the process of noticing, the reader begins to attribute increasing weight to the meanings they locate in the text. One model for this hypothesis is the experience of the foreign language learner who reads a text with her attention drawn to the expressions she recognises and can construe.
The principal problem posed by our project is therefore to extract from the discourse stuff that we might be able to discern as potential concepts. In other words, we aim to identify a concept from the discourse inwards by inspecting the language instead of defining a concept from its content outward (i.e. starting with a term and discerning its meaning). If we move from the discourse inwards, the meanings that we attribute weight to may be implicit and distributed across a stretch of text, in a text window.
That is, the meanings we notice as relevant might not be encapsulated in individual lexical items or character strings within a simple syntactic frame. This recognition requires that we resist the temptation to treat a word or a character string as coterminous with a concept. Indeed, the more we associate relevance with, say, the frequency of a particular word or character string in a sub-corpus, the less likely we are to be able to look beyond the word as an index of a concept. To remain open and receptive in the process of candidate concept recognition, we need to expand the range of the things we inspect on the one hand and the scope of the context we read on the other.
The linguistic material that will be relevant to the identification of a concept will consist of a combination or set of expressions in association that occur in a concentrated fashion in a stretch of text. Importantly, this material may consist of lexical items, phrases, sentences, and may be conveyed metaphorically as well as literally, and likely pragmatically (by implicature and invited inference) as well as semantically. If the linguistic elaboration (definition, paraphrase, implication) of a concept precedes the lexicalization of a concept, it is reasonable to assume that the appearance of regularly and frequently occurring expressions in degrees of proximity within a window will aid the identification of a concept.
The scope of the context in which a concept appears is likely to be greater than the phrase or sentence that is the context for the keyword that we customarily consider in collocation studies. This context is akin to the modern notion of the paragraph, or, the unit of discourse which conventionally treats a topic or subject with the commentary that makes up the content of the paragraph. The stretch of text relevant for the identification of conceptual material may thus amount to a paragraph, a page, or a short text.
The linguistic structure of a concept has been shown to be built both paradigmatically (via synonymy) and syntagmatically (via lexical associations, syntax, paraphrase). For our purposes, given that the task entails picking up clues to the construction of concepts from the linguistic material in the context, where ‘context’ is defined pretty broadly, paradigmatic relations are less likely to be salient than syntagmatic relations like paraphrase, vagueness and association, perhaps more than predictable relations like antonymy and polysemy.
See the final post in this Manifesto series.
Divjak, Dagmar & Gries, Stefan Th. (eds). 2012. Frequency effects in language learning and processing (Vol. 1). Berlin: De Gruyter
As those who follow our Twitter account will know, Linguistic DNA’s principal investigator, Susan Fitzmaurice, was among the invited speakers at the recent symposium on Digital Humanities & Conceptual Change (organised by Mikko Tolonen, at the University of Helsinki). It was an opportunity to set out the distinctive approach being taken by our project and the theoretical understanding of concepts that underpins it. What follows is the first of three blog posts based on extracts from the paper, aka the Linguistic DNA ‘manifesto’. Susan writes:
Linguistic DNA’s goal is to understand the ways in which the concepts (or paradigmatic terms) that define modernity emerge in the universe of Early Modern discourse. The methodology we are committed to developing, testing and using, i.e. the bottom-up querying of a universe of printed discourse in English, demands that we take a fresh look at the notion of a concept and its content. So how will we operationalise a concept, and how will we recognise a concept in the data?
Defining the content of a concept from above
Historians and semanticists alike tend to start by identifying a set of key concepts and pursue their investigation by using a paradigmatic approach. For semanticists, this entails identifying a ‘concept’ in onomasiological terms as a bundle of (near-)synonyms that refer to aspects of the semantic space occupied by a concept in order to chart conceptual change in different periods and variation in different lects.
Historians, too, have identified key concepts through keywords or paradigmatic terms, which they then explore through historiography and the inspection of historical documents, seeking the evidence that underpins the emergence of particular terms and the forces and circumstances in which these change (Reinhart Koselleck’s Begriffsgeschichte or Quentin Skinner’s competing discourses). Semanticists and historians alike tend to approach concepts in a primarily semasiological way, for example, Anna Wierzbicka (2010) focuses on the history of evidence, and Naomi Tadmor (1996) uses ‘kin’ as a starting point for exploring concepts based on the meanings of particular words.
Philosophers of science, who are interested in the nature of conceptual change as driven or motivated by scientific inquiry and technological advances, may see concepts and conceptual change differently. For example, Ingo Brigandt (2010) argues that a scientific concept consists of a definition, its ‘inferential role’ or ‘reference potential’ and the epistemic goal pursued by the term’s use in order to account for the rationality of semantic change in a concept. So the change in the meaning of ‘gene’, from the classical gene which is about inheritance in the 1910s and 1920s, to the molecular gene in the 1960s and 1970s which is about characteristics, can be shown to be motivated by the changing nature of the explanatory task required of the term ‘gene’. In such a case, the goal is to explain the way in which the scientific task changes the meaning associated with the terms, rather than exploring the change itself. Thus Brigandt tries to make it explicit that
‘apart from a changing meaning (inferential role) [the concept also has] an epistemic goal which is tied to a concept’s use and which is the property setting the standards for which changes in meaning are rational’ (2010: 24).
His understanding of the pragmatics-driven structure of a concept is a useful basis for the construction of conceptual change as involving polysemy through the processes of invited inference and conversational implicature (cf. Traugott & Dasher, 2002; Fitzmaurice, 2015).
In text-mining and information retrieval work in biomedical language processing, as reported in Genome Biology, concept recognition is used to extract information about gene names from the literature. William Baumgartner et al. (2008) argue that
‘Concepts differ from character strings in that they are grounded in well-defined knowledge resources. Concept recognition provides the key piece of information missing from a string of text—an unambiguous semantic representation of what the characters denote’ (2008: S4).
Admittedly, this is a very narrow definition, but given the range of different forms and expressions that a gene or protein might have in the text, the notion of concept recognition needs to go well beyond the character string and ‘identification of mentions in text’. So they developed ‘mention regularization’ procedures and disambiguation techniques as a basis for concept recognition involving ‘the more complex task of identifying and extracting protein interaction relations’ (Baumgartner et al. 2008: S7-15).
In LDNA, we are interested in investigating what people (in particular periods) would have considered to be emerging and important cultural and political concepts in their own time by exploring their texts. This task involves, not identifying a set of concepts in advance and mining the literature of the period to ascertain the impact made by those concepts, but querying the literature to see what emerges as important. Therefore, our approach is neither semasiological, whereby we track the progress and historical fortunes of a particular term, such as marriage, democracy or evidence, nor is it onomasiological, whereby we inspect the paradigmatic content of a more abstract, yet given, notion such as TRUTH or POLITY, etc. We have to take a further step back, to consider the kind of analysis that precedes the implementation of either a semasiological or an onomasiological study of the lexical material we might construct as a concept (e.g. as indicated by a keyword).
See the next post in this Manifesto series.