Category Archives: Uncategorized

Of concepts and kings: curating a collection using EEBO-TCP

In Spring 2018, MA student Sophie Whittle dedicated 100 hours of hard graft to filling blanks in the Terms metadata. This work follows on from Winnie Smith’s work last year, identifying gaps in the Text Creation Partnership’s metadata. Sophie reflects on her experience and offers some tentative analysis focused on texts from the English Civil War and Commonwealth era:

As an MA linguistics student, I was elated to see an opportunity to apply for Linguistic DNA as part of the School of English’s work placement module. I am looking to apply for a PhD in the future, and I saw the project as a chance to undertake independent research. After completing a corpus-based dissertation about the semantic-syntactic development of the verb promise, I was looking forward to exploring historical texts in a different way—through the use of EEBO-TCP and LDNA computational methods.

After undertaking three years of linguistic study, I realised I had not delved into much historical work as my interests were very much theoretical at the time. Working on the placement has allowed me to regain historical interests which I had left back at GCSE and A-Level. During the summer prior to Masters study, I was in conversation about how King Charles I, who had intended to dine with the Governor of Hull (my home city) was stopped at Beverley Gate by Parliament. As the Governor had expressed allegiance to the Parliamentarians, he was named a traitor, but Charles was forced to return to York. This defining moment, not only for Hull but for Civil War history, piqued my interest. I was able to reflect on the conversation I’d had over summer whilst contributing to the inputting of empty metadata terms for LDNA and viewing documents of news and conflict from the Civil War. I wanted to understand the general public’s attitudes to the different sides of the Civil War in England by using LDNA’s conceptual modelling methods.

Towards the end of my placement, I wrote a proposal for a Civil War and Commonwealth collection to be included on the final LDNA interface. The large amount of texts from this period (a search via ProQuest’s EEBO interface (eebo.chadwyck.com) brings back 35,008 records) are of interest to researchers from linguistics, literature, history, theology, etc. Modelling concepts from and across the period, by determining frequently co-occurring words as pairs or trios, makes it clear that there is a wealth of information to research. My proposal employed the following term definitions:

Title contains: king
Terms contains: civil war, or commonwealth
Date range: 1642 to 1660

By defining these parameters, the idea was that a user could analyse attitudes towards different sides in the Civil War, from both Royalist and Parliamentarian perspectives.

While designing my collection, I also wanted to see if I could mimic LDNA’s use of ‘windows’. I came across a four-text sample about the preserving of peace during the Civil War, with each text from different viewpoints. For instance, one of the texts discusses the New Model Army, a dissenting faction of the Parliamentarian side, as the ‘obstructors’ to peace in the kingdom (A25836). The army was largely independent and held radical Puritan views during the Civil War. Alternatively, a different text suggests peace was only possible if King Charles I prospered, rejecting Parliamentarianism (A25857). I used the node words ‘King’ and ‘Parliament’ as a starting point and counted ten words either side the node words (W20). I then calculated the frequency of the words with the most tokens, and came up with the following table (the frequency is displayed as a percentage):

Table showing cooccurrences with the nouns king and parliament across sampled texts.

Sampled cooccurrences with ‘king’ and ‘parliament’ for a window of 20 words.

The percentages are not particularly high (due to the size of the window, perhaps increasing the size of the window might help solve this issue). However, by identifying the ten most frequently co-occurring lemmas with the node words, a number of interesting results appear. For instance, peace co-occurs more frequently alongside King than Parliament, suggesting something about the bias of the texts. Further close-reading might indicate why peace was continuously associated with Royalism (or not, which emphasises the importance of close-reading!).

Additionally, the reason for vote co-occurring alongside Parliament might seem obvious at first. Yet analysing this within its context provides a different story. Most of the co-occurrences appear in the text slandering the New Model Army. In this text, the author believes that Parliament are voting in response to pressures from the NMA, and are therefore void of their privileges as a democratic union. A list of evidence is provided by the author to explain how Parliament have been revoked of their privileges (during the ‘tumult of the Apprentices’, when apprentices were freed from their masters and asked to join the NMA in ‘a state of confusion’). In the author’s view, the NMA forced Parliament to go against their morals by undoing their previous work.

I also gained access to trio output data from Susan and her work on Newsbooks, to see if I could find something similar. The top concepts for ‘King’ and ‘Parliament’ are ‘king – lord – parliament’ and ‘parliament – state – council’ respectively, which are expected from the genre of texts. By looking at slightly less frequent trios, there are some more intriguing items. The following trios complement the data I had found manually from the four-text sample: ‘king – people – liberty’ with a PMI of 3.09 and ‘parliament – present – authority’ with a PMI of 4.09. (Both PMIs show that the observed trios occur more often than expected by chance.) There is so much data to explore here, highlighting that LDNA should be an excellent resource for conducting quantitative study. As shown, it is important to analyse the findings within their contexts to specify how concepts are cemented in history. LDNA promote the combination of distant (using statistical methods) and close reading. It was interesting to imagine the final interface with the addition of literary analysis.

Working with the TCP metadata has allowed me to explore concepts within the Civil War and Commonwealth period and finalise the work placement by writing a collection brief. As a linguistics student, it has been a real challenge to identify texts based on their literary genre. This pushed me out of my comfort zone. Being able to use my semantic skills to pull apart the meaning behind the conceptual findings has helped too. I am very grateful to have been given the opportunity to use my existing skills and gain new ones!

Featured image: Hull City Skyline. From an original photograph by John Bannon. Used under license CC 3.0.

Documenting categories in EEBO-TCP data

As part of a work placement with Linguistic DNA, University of Sheffield MA student Winnie Smith has been examining the metadata that accompanies the Text Creation Partnership transcriptions of Early English Books Online (EEBO-TCP). Released as a CSV file (here), it combines the various codes that are used to identify different formats of EEBO (and related products). It also includes basic information like author, title, and date of publication. The focus of Winnie’s work described here are the “terms” or “subject headings”, which represent earlier attempts at cataloguing the items in the collection. What can these tell us about what’s in our main dataset?

Winnie writes:

One of the main reasons I’d applied to do a placement with LDNA was the opportunity to work with historical text, so I was very happy be given the chance to analyse EEBO-TCP metadata. Time disappears browsing EEBO (I never knew an agnus dei could be a little wax figurine as well as a mass movement), but I soon realised that analysing it in any depth is a pretty warts and all experience.

The initial target was to select a narrow date range, say 1610-1612, and try to identify broad genre and / or format categories for EEBO documents in that period (e.g. sermons, broadsides, or biographical works). The data sent back from this probe would then help LDNA in its mission to explore the “universe of printed discourse”.

However, getting the data turned out to be more complicated than it first appeared. I decided early that trying to determine semantic categories manually was both unsystematic and impractical. For 1610 alone, for example, there were 49 overtly Christian terms. How should they be grouped together? What about the classifications that might not apply to that year, but arise in other years?

In addition, the fact that the category labels in the metadata originated in the English Short Title Catalogue (ESTC), and had been provided in different ways by different people, meant that there were considerable differences in how texts were catalogued, leading to uncertainty about how they should be classified. There is also the potential for category clashes between modern labels and early modern concepts, for example of non-Christian religions. For example, did thematic entries which mentioned “Jews” refer to Jewish religious texts, learned works, or anti-Semitic material which within an Early Modern frame might claim to be religious and / or scholarly?

Terms from below

I therefore switched to looking for semantic clusters which might emerge directly from the terms. This approach had definite benefits. By manually splitting and sorting the single “Terms” column (using semicolons as the delimiter), I was able to see categories I would have missed if I had continued to rely on manual semantic classification, e.g. “broadsides”:

Screengrab of an alphabetised table, featuring entries beginning with "Broadside".

Figure 1: “Broadsides” emerging as a category in alphabetically sorted TCP metadata.

I also noticed that there were different kinds of categories present in the terms: far from all groupings being topic-based (e.g. repentance), there were several kinds of classification—in the case of broadsides, by printed format.

However, alphabetical manual sorting was time-consuming, and not guaranteed to unite similar content. For example, the entry

Conspiracies — Sermons — Early works to 1800. (A07558)

appeared separately from

Sermons, English — 17th century (A14860).

I decided to try and address that problem by further splitting, though this was complicated by different cataloguers punctuating in different ways. This made it difficult to split off information that wanted separating without separating information that didn’t. For example, all records use a double hyphen (–) to separate off separate units within a single classification. However, in the following record, a double hyphen appears within a single unit (James I) as well as between separate units, e.g. James I | King of England:

James — I, — King of England, 1566-1625. — Triplici nodo, triplex cuneus — Early works to 1800 (A20944)

After some pre-processing in Excel to try and make the punctuation and contents of the terms column more consistent (removing full stops; replacing double hyphens in front of regnal numbers; filtering out blanks or replacing them with dummy text), I loaded it into Rstudio, full of analytical idealism.

This was where the fun began.

The metadata contains widely differing numbers of term entries for different records. I had gravely underestimated the difficulties this would pose in R, which requires rows and columns to be of equal length. The issue was only compounded by the fact that the required number of new term columns (n) was also unknown. I eventually got round this by finding n first (56, as it turns out), then making all rows have n columns. This isn’t a particularly elegant coding solution, particularly because it uses multiple for-loops (which I know R frowns on) and these were slow for a dataset of over 53,000 rows. There are also issues about essentially counting a construct, X, as a proxy for term entries, where

X = something between one semicolon and another.

Observations

Caveats aside, this procedure enabled some interesting observations. The frequency of individual terms, for example, has a very long tail: 16997 items (63% of the total), occur only once. Sorting the terms alphabetically shows that there is still a great deal of noise, at least at the lower end:

Terms including a long string split by a hyphen (row 3), and the full form of 's-Hertogenbosch with its initial apostrophe.

Figure 2: A snapshot of alphabetically sorted individual “terms” and their frequencies.

This snapshot from the output table (Figure 2) shows that some rows, e.g. 3, have been split incorrectly, despite pre-processing to try and deal with different numbers of hyphens. Other potentially tricky cases are correct (e.g. row 5). For some rows (e.g. 14–15) hidden characters clearly affected the result, and it might be worth adding extra processing steps. (I didn’t, for example, remove brackets or capitalisation.)

However, the process worked and was worth it. Even accepting that treating X as a distinct unit is an approximation for catalogue terms, and that the results are imperfect, being able to see the full number and range of cataloguing terms used in the EEBO-TCP metadata (294 378 total term quasi-tokens representing 26 858 unique quasi-types) frees them up for further independent manipulation.

Snapshot includes: Early works to 1800, History, Great Britain, 17th century, England

Figure 3: Top 10 cataloguing terms in EEBO-TCP metadata (by frequency).

It was also instructive to see which expressions (with X ≈ terms) were most frequent in the dataset. Figure 3 (right) shows the top 10, while Figure 4 (below) visualises the relative distribution of the top 25.

Bar chart showing the quantity of records covered by the 25 most frequent terms.

Figure 4: The 25 most frequent cataloguing terms in EEBO-TCP metadata.

As the bar chart shows, by far the most frequent catalogue term is wholly uninformative: all works in EEBO-TCP are pre-1800, so “early literature to …” only confirms the selection criteria which apply to the dataset.

Perhaps more interesting is that the 6th most frequent item is “not supplied”; this shows how many records lack any kind of cataloguing information: “not supplied” was a dummy entry I inserted to remove blanks in pre-processing.

EDIT (7/7/17): We now realise the 2473 works from ECCO-TCP were inadvertently included in the term processing, inflating the reported blank count (“not supplied”). The true count for EEBO-TCP is 3102 items without terms, or ~6% of the total. Only 23 ECCO items have subject headings, with 2 containing the phrase “Early works to 1800”.
Keen-eyed readers may also have spotted “England” twice in the top terms, an artefact of stray spaces (following term-splitting). Removing this irregularity, we find 20,074 unique term-units; extracting commas and parentheses leaves 19864 distinct units—the various biblical figures of Figure 2 (lines 14-16) now represented by one common “biblicalfigure”.

This [nevertheless] shows that messiness extends to the upper as well as lower frequency limits of the dataset, though perhaps in a different sense: at the higher extreme, the issue appears to be less one of dealing with potentially large numbers of small inaccuracies, and more one of high-frequency results which are not all equally useful for semantic analysis.

However, one of the things I have learned on this placement is that no result is unwelcome. Unsurprisingly, EEBO-TCP was never designed for the type of digital text analysis LDNA conducts, meaning LDNA-type processes aren’t always intuitive to apply, but this does not detract from the benefits of working with EEBO, even (especially if) the results remind you to proceed with caution.

On a more personal level, the EEBO-TCP metadata is not the easiest thing to handle as a beginner (I did vainly wish it better, quite a lot), but extremely valuable for precisely that reason. Although I didn’t get much further than writing (cracking) the code to extract individual terms from the master dataset and beginning some summary statistics, I learned a vast amount about being pragmatic and patient with your data (and your programming environment), documenting your working, and the pain of struggling to make sense of messy, non-textbook data is in fact an inevitable part of the process I most enjoy: working with historical text which was never written for a machine to breeze through.

As I finish my placement, I hope my results will allow LDNA to continue analysing EEBO-TCP in more depth, and I’m looking forward to continuing the different kinds of pleasure that come from text analysis (from wonder to perseverance) into a PhD.

Note: For those who might wish to replicate Winnie’s work (noting the disclaimer about inelegant code), here is closer documentation of that process:

I used the following procedure:

a) find n

1) looping over each row of the dataset and calculating the number of term entries for each record
2) adding each result to a vector of term lengths (see code extract 1, below)
3) finding the largest number in this vector (see code extract 2, below)

b) create equal numbers of split terms for each record

4) looping back over the dataset, splitting each row, and making up any difference between the number of terms for that record and the maximum number of terms by adding missing values (NAs) so that each row was of equal length.

c) combine all the newly created term columns into a single column, removing the missing values.

Code extract 1 (above). Code extract 2 (below).

Quantity and quality: lessons from an MA work placement

Sheffield MA student Nadia Filippi reflects on her experience after 100 hours with the Linguistic DNA team at DHI | Sheffield:

As part of my MA studies in English Language and Linguistics, I had the opportunity to undertake a work placement of 100 hours at the University of Sheffield’s Digital Humanities Institute. The placement offered a good overview on the typical tasks and responsibilities of a researcher and was an excellent choice for me because I am interested in doing research and I am considering going onto PhD research.

When registering for the placement module, I only had basic knowledge of corpus linguistics. I was accustomed to qualitative research but wanted to discover quantitative methodologies and the possibilities that quantitative research can offer. Starting my placement, I was at a stage in my studies in which I was still looking for definite answers to all my questions about research. Moreover, I respected everything to do with numbers, but the idea of actually ‘doing statistics’ made me nervous. I consciously chose a placement to force myself out of my qualitative comfort zone.

My concerns resolved themselves during the placement. I had to familiarise myself with and use statistical software packages like SPSS and lost my initial fear. I began to understand how statistics could be used effectively to discuss questions and find information that qualitative research could not do in timely manner. For example, finding out which words frequently co-occur in a large dataset. Furthermore, I came to understand that doing research does not exclusively mean to narrowly focus on finding a clear answer to an initial research question. It is often more about refining the question, developing another one and accepting that there can be more than one right answer to it.

The power of the Digital Humanities Institute lies in quantitative analysis, engaging with statistical distribution, auditing datasets and computational methods. Yet, there is still qualitative work to do. For instance, I audited and reported on qualities of the YouTube dataset, wrote summaries of previous research and searched for suitable approaches or tools (e.g. a Part-Of-Speech tagger suited to social media data), by consulting published research from similar projects.

A YouTube Convert

It turned out that the placement as a whole, the experiences I made and the tasks I was given shaped my other studies. At the beginning of my placement, the Linguistic DNA team had just started providing support for the Militarization 2.0 project, in collaboration with the University of Leeds. I was immediately drawn-in by this study of YouTube gaming discussion and it ultimately gave me an idea for my MA dissertation.

I had the chance to look through some of the 6.7m YouTube comments gathered by Nick Robinson and his team at the University of Leeds, and think through how they might be analysed for concept modelling.

Screenshot showing comments on Battlefield 1 official trailer, via YouTube (15 May 2017). https://www.youtube.com/watch?v=c7nRTF2SowQ

In exploring the comments, I had to consider the characteristics of commenters’ language and reflect on the research questions. Gaming language, for example, is filled with specialist abbreviations such as “CoD:ww2”, which stands for the game Call of Duty: World at War 2. Information about nationalities (“the Germans”) and militarised language (“disabled”, “destroyed”) may also be key to answering questions about how users’ remarks connect with video content. Close reading of excerpts helps to inform how the Sheffield team respond to the main interests of the mother project Militarization 2.0: if and how social media is militarized and what effect that has on our society and the individual citizens.

By attending meetings, I gained insights into the process and decision-making in a big research project. This included, for example:

preparing big data (should we standardise the spelling of the comments or not?)
practical obstacles, such as YouTube’s technical limitations (which prevent us from retrieving all the answers to a specific comment)
deciding which variables to include (time, author, number of likes)
time and scope (how can the resources available be matched to the aims and desired outcomes of a project?)

Knowing the kinds of challenges that such a project can face was helpful in planning my dissertation, which I will be writing over the summer. Prompted by the DHI’s YouTube work, my research will discuss the kind of language generated by exposure to military video game trailers and investigate if there is a difference between the language produced online and offline. In undertaking this research, I will work with my own corpus of YouTube comments as well as with focus groups. The qualitative aspect of my dissertation will allow me to explicitly address and discuss the violence in these game trailers within my focus groups.

Overall, the work placement has been one of the most valuable and enjoyable modules of my MA. I developed many new skills, academically as well as personally. I am more confident about quantitative approaches and numbers, as well as the importance of humanities research as a whole.

Top image shows Sheffield MA student Nadia Filippi at the Linguistic DNA and Militarization 2.0 stand at the 2017 Festival of Arts & Humanities Showcase, Sheffield. The showcase was “a fantastic opportunity to open a dialogue about humanities research and its impact with the public”.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”

The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd). Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!

* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Errors, searchability, and experiments with Thomason’s Newsbooks

Back in 2012, HRI Digital ran a project, with the departments of English, History, and Sociological Studies, looking at participatory search design. The project took as its focus a subset of George Thomason’s 17th-century newsbooks, transcribing every issue of Mercurius Politicus plus the full selection of newsbooks published in 1649 (from the images available through ProQuest’s Early English Books Online). Building the interactive interface, the Newsbooks project focused on how researchers interact with (and want to interact with) such historical texts. Thus, for example, search results may feature texts published at the same point in time. A problem not resolved in the original phase was variant spellings, and the humanities investigators held onto concerns about (in)accuracy in the transcriptions.

The tools tried out for Linguistic DNA have provided a fresh mechanism to improve the Newsbooks’ searchability. Sheffield MA student Amy Jackson recently completed a 100-hour work placement investigating how a MorphAdorned version of the Newsbooks could inform questions about the accuracy of transcriptions, and how a statistically-organised representation of the language data (an early output of LDNA’s processor) affects understanding of the content and context of Thomason’s collection.

Amy reports:

My main task during my placement has been to find errors within the newsbooks, both printing and transcription errors, in order to improve the searchability of the newsbooks. I’ve been using methods such as checking hapax legomena (words that only occur once within a text or collection of texts) and Pointwise Mutual Information (PMI).

Note from the editors: PMI measures word associations by comparing observed cooccurrences with what might be expected in a random wordset (based on the same data).
—Expect a blog post on this soon!

The hurried composition of the newsbooks causes problems for searchability. It seems those printing the newsbooks were less concerned with accuracy than those who were printing books. This can be seen in several examples that I have found while searching through the hapax legomena. For example, on one occasion ‘transmitted’ is printed as ‘trasmitte4’ with a ‘4’ being used as a substitute for a missing ‘d’ (see image above). Elsewhere the number ‘8’ is used as a substitute for a capital ‘S’, printing ‘Sea’ as ‘8ea’. Such printing decisions present a specialised problem for searches because they are unusual. Knowing this characteristic (replacing letters with numbers) means one can look at modifying search rules to improve the ‘success’ in finding relevant information.

High PMI values can also be used to find unusual words or word pairs that aren’t errors. While I was searching through the high PMI values I came across the word ‘King-chopper’ – used as an insult to refer to Colonel John ‘Tinker’ Fox who was falsely rumoured to be one of King Charles I’s executioners in 1649. The Man in the Moon, the newsbook in which the reference appears, was printed by John Crouch. Crouch was a Royalist journalist who was arrested and imprisoned for printing The Man in the Moon after the King’s death.

Mid-range PMI values are useful for understanding how language was used in the newsbooks. ‘Loyal’ often co-occurs with words such as ‘crown’, ‘royalist’, ‘sovereign’, ‘majesty’, ‘Charles’, ‘usurp’, and ‘treason’. This implies that the word ‘loyal’ is mainly being used by Royalist newsbooks in 1649 rather than Parliamentarian newsbooks. If I had more time I would look more closely at the differences in the language used by Royalist and Parliamentarian newsbooks.

PMI and hapax legomena have been useful for finding errors within the newsbooks but they have mainly provided an interesting way for me to interact with the texts. The PMI data often encouraged me to research the newsbooks and the people who printed them further and hapax legomena have provided useful insights into how the newsbooks were printed in 1649.

Naomi Tadmor: Semantic analysis of keywords in context

On 30 October, Prof. Naomi Tadmor led a workshop at the University of Sheffield, hosted by the Sheffield Centre for Early Modern Studies. In what follows, I briefly summarise Tadmor’s presentation, and then provide some reflections related to my own work, and to Linguistic DNA.

The key concluding points that Tadmor forwarded are, I think, important for any work with historical texts, and thus also crucial to historical research:

Understanding historical language (including word meaning) is necessary for understanding historical texts
To understand historical language we must analyse it in context.
Analysing historical language in context requires close reading.

Whether we identify as historians, linguists, corpus linguists, literary scholars, or otherwise, we would do well to keep these points in mind.

Tadmor’s take on historical keywords

Tadmor’s specific arguments in the master class focused on kinship terms. In Early Modern English (EModE), there was a broad array of referents for kinship terms such as brother, mother, father, sister, and associated terms like family and friend, which are not likely to be intuitive to a speaker of Present Day English (PDE). Evidence shows, for example, that family often referred to all of the individuals living in a household, including servants, to the possible exclusion of biological relations living outside of the household. The paper Tadmor asked us to read in advance (first published in 1996), supplemented with other examples at the masterclass, provides extensive illustrations of the nuance of family and other kinship terms.

In EModE, there was also a narrow range of semantic or pragmatic implications related to kinship terms: these meanings generally involved social expectations, social networks, or social capital. So, father could refer to ‘biological father’ or ‘father-in-law’ (or even ‘King’), and implied a relationship of social expectation (rather than, for example, a relationship of affection or intimacy, as might be implied in PDE).

By identifying both the array of referents and the implications or senses conveyed by these kinship terms, Tadmor provides a thorough illustration of the terms’ lexical semantics. We can see this method as being motivated by historical questions (about the nature of Early Modern relationships); driven in its first stage by lexicology (insofar as it begins by asking about words, their referents, and senses); and then, in a final stage, employing lexicological knowledge to analyse texts and further address the initial historical questions. Tadmor avoids circularity by using one data set (in her 1996 paper) to identify a hypothesis regarding lexical semantics, and another data set to test her hypothesis. What do these observations about lexical semantics tell us about history? As Tadmor notes, it is by identifying these meanings that we can begin to understand categories of social actions and relationships, as well as motivations for those actions and relationships. Perhaps more fundamentally, it is only by understanding semantics in historical texts, that we can begin to understand the texts meaningfully.

A Corpus Linguist’s take on Tadmor’s methods

Reflecting on Tadmor’s talk, I’m reminded of the utility of the terms semasiology and onomasiology. In semantic research, semasiology is an approach which examines a term as an object of inquiry, and proceeds to identify the meanings of that word. Onomasiology is an approach which begins with a meaning, and then identifies the various terms for expressing it. Tadmor’s method is largely semasiological, insofar as it looks at the meanings of the term family and other kinship terms. This approach begins in a relatively straightforward way—find all of the instances of the word (or lemma), and you can then identify its various senses. The next step is more difficult: how do you distinguish its senses? In linguistics, a range of methods is available, with varying degrees of rigour and reproducibility, and it is important that these methods be outlined clearly. Tadmor’s study is also onomasiological, as she compares the different ways (often within a single text) of referring to a given member of the household family. This approach is less straightforward: how do you identify each time a member of the family is referred to? Again, a range of methods is available, each with its own advantages and disadvantages. A clear statement and justification of the choice of method renders any study more rigorous. In my experience, the systematicity of thinking in terms of onomasiology and semasiology is useful in developing a systematic and rigorous study.

Semasiology and onomasiology allow us to distinguish types of study and approaches to meaning, which can in turn help render our methods more explicit and clear. Similarly, distinguishing editorially between a word (e.g. family) and a meaning (e.g. ‘family’) is useful for clarity. Indeed, thinking methodologically in terms of semasiology and onomasiology encourages clarity of expression editorially regarding terms and meanings. In Tadmor’s 1996 paper, double quotes (e.g. “family”) are used to refer to either the word family or the meaning ‘family’ at various points. At times, such a paper could be rendered more clear, it seems to me, by adopting consistent editorial conventions like those used in linguistics (e.g. quotes or all caps for meanings, italics for terms). The distinction between a term and a meaning is by nature not always clear or certain: that difficulty is all the more reason for journals to adhere to rigorously defined editorial conventions.

From the distinction between terms and concepts, we can move to the distinction between senses and referents. It is important to be explicit both about changes in referent and changes in sense, when discussing semasiological change. For example, as historians and linguists, we must be sure that when we identify changes in a word’s referents (e.g. father referring to ‘father-in-law’), we also identify whether there are changes in its sense (e.g. ‘a relationship of social expectation’ or ‘a relationship of affection and intimacy’). When Thomas Turner refers to his father-in-law as father, he seems to be using the term, as identified by Tadmor, in its Early Modern sense implying ‘a relationship of social expectation’ rather than in the possible PDE sense implying ‘a relationship of affection and intimacy’. The terms referent and sense allow for this distinction, and are useful in practice when conducting this kind of semantic analysis.

Of course, if a term becomes polysemous, it can be applied to a new range of referents, with a new sense, or even with new implicatures or connotations. For example, we can imagine (perhaps counterfactually) a historical development in which family might have come to refer to cohabitants who were not blood relations. At the same time, in referring to those cohabitants who were not blood relations, family might have ceased to imply any kind of social expectation, social network, or social capital. That is, it’s possible for both the referent and the sense to change. In this case, as Tadmor has shown, that doesn’t seem to be what’s happened, but it’s important to investigate such possible polysemies.

Future possibilities: Corpus linguistics

As a corpus linguist, I’d be interested in investigating Tadmor’s semantic findings via a quantitative onomasiological study, looking more closely at selection probabilities. Such a study could ask research questions like:

Given that an Early Modern writer is expressing ‘nuclear family’, what is the probability of using term a, b, etc., in various contexts?
Given that a writer is expressing ‘household-family’, what is the probability of using term a, b, etc., in various contexts?
Given that a writer is expressing ‘spouse’s father’ or ‘brother’s sister’, etc., what is the probability of using term a, b, etc., in various contexts?

These onomasiological research questions (unlike semasiological ones) allow us to investigate logical probabilities of selection processes. This renders statistical analyses more robust. Changes in probabilities of selection over time are a useful illustration of onomasiological change, which is an essential part of semantic change.

And for Linguistic DNA?

For Linguistic DNA, I see (at least) two major questions related to Tadmor’s work:

Can automated distributional analysis uncover the types of phenomena that Tadmor has uncovered for family?
What is a concept for Tadmor, and how can her work inform our notion of a concept?

In response to the first question, it is certainly possible that distributional analysis can reflect changing referents (such as ‘father-in-law’ referred to as father). Hypothetically, the distribution of father with a broad array of referents might entail a broad array of lexical co-occurrences. In practice, however, this might be very, very difficult to discern. Hence Tadmor’s call for close reading. It is perhaps more likely that the sense (as opposed to referent) of father as ‘a relationship involving social expectations’ might be reflected in co-occurrence data: hypothetically, father might co-occur with words related to social expectation and obligation. We have evidence that semantically related words tend to constitute only about 30% of significant co-occurrences. Optimistically, it might be that the remaining 70% of words do suggest semantic relationships, if we know how to interpret them—in this case, maybe some co-occurrences with family would suggest the referents or implications discussed here. Pessimistically, it might be that if only 30% of co-occurring words are semantically related, then there would be an even lower probability of finding co-occurring words that reveal such fine semantic or pragmatic nuances as these. Thanks to Tadmor’s work, Linguistic DNA might be able to use family as a test case for what can be revealed by distributional analysis.

What is a concept? Tadmor (1996) doesn’t define concept, and sometimes switches quickly, for example, between discussing the concept ‘family’ and the word family, which can be tricky to follow. At times, concept for Tadmor seems to be similar to definition—a gloss for a term. At other times, concept seems to be broader, suggesting something perhaps with psycholinguistic reality, a sort of notion or idea held in the mind that might relate to an array of terms. Or, concept seems to relate to discourses, to shared social understandings that are shaped by language use. Linguistic DNA is paying close attention to operationalising and/or defining concept in its approach to conceptual and semantic change in EModE. Tadmor’s work points in the same direction that interests us, and the vagueness of concept which Tadmor engages with is vagueness that we are engaging with as well.