Tag Archives: TCP

Of concepts and kings: curating a collection using EEBO-TCP

In Spring 2018, MA student Sophie Whittle dedicated 100 hours of hard graft to filling blanks in the Terms metadata. This work follows on from Winnie Smith’s work last year, identifying gaps in the Text Creation Partnership’s metadata. Sophie reflects on her experience and offers some tentative analysis focused on texts from the English Civil War and Commonwealth era:

As an MA linguistics student, I was elated to see an opportunity to apply for Linguistic DNA as part of the School of English’s work placement module. I am looking to apply for a PhD in the future, and I saw the project as a chance to undertake independent research. After completing a corpus-based dissertation about the semantic-syntactic development of the verb promise, I was looking forward to exploring historical texts in a different way—through the use of EEBO-TCP and LDNA computational methods.

After undertaking three years of linguistic study, I realised I had not delved into much historical work as my interests were very much theoretical at the time. Working on the placement has allowed me to regain historical interests which I had left back at GCSE and A-Level. During the summer prior to Masters study, I was in conversation about how King Charles I, who had intended to dine with the Governor of Hull (my home city) was stopped at Beverley Gate by Parliament. As the Governor had expressed allegiance to the Parliamentarians, he was named a traitor, but Charles was forced to return to York. This defining moment, not only for Hull but for Civil War history, piqued my interest. I was able to reflect on the conversation I’d had over summer whilst contributing to the inputting of empty metadata terms for LDNA and viewing documents of news and conflict from the Civil War. I wanted to understand the general public’s attitudes to the different sides of the Civil War in England by using LDNA’s conceptual modelling methods.

Towards the end of my placement, I wrote a proposal for a Civil War and Commonwealth collection to be included on the final LDNA interface. The large amount of texts from this period (a search via ProQuest’s EEBO interface (eebo.chadwyck.com) brings back 35,008 records) are of interest to researchers from linguistics, literature, history, theology, etc. Modelling concepts from and across the period, by determining frequently co-occurring words as pairs or trios, makes it clear that there is a wealth of information to research. My proposal employed the following term definitions:

Title contains: king
Terms contains: civil war, or commonwealth
Date range: 1642 to 1660

By defining these parameters, the idea was that a user could analyse attitudes towards different sides in the Civil War, from both Royalist and Parliamentarian perspectives.

While designing my collection, I also wanted to see if I could mimic LDNA’s use of ‘windows’. I came across a four-text sample about the preserving of peace during the Civil War, with each text from different viewpoints. For instance, one of the texts discusses the New Model Army, a dissenting faction of the Parliamentarian side, as the ‘obstructors’ to peace in the kingdom (A25836). The army was largely independent and held radical Puritan views during the Civil War. Alternatively, a different text suggests peace was only possible if King Charles I prospered, rejecting Parliamentarianism (A25857). I used the node words ‘King’ and ‘Parliament’ as a starting point and counted ten words either side the node words (W20). I then calculated the frequency of the words with the most tokens, and came up with the following table (the frequency is displayed as a percentage):

Table showing cooccurrences with the nouns king and parliament across sampled texts.

Sampled cooccurrences with ‘king’ and ‘parliament’ for a window of 20 words.

The percentages are not particularly high (due to the size of the window, perhaps increasing the size of the window might help solve this issue). However, by identifying the ten most frequently co-occurring lemmas with the node words, a number of interesting results appear. For instance, peace co-occurs more frequently alongside King than Parliament, suggesting something about the bias of the texts. Further close-reading might indicate why peace was continuously associated with Royalism (or not, which emphasises the importance of close-reading!).

Additionally, the reason for vote co-occurring alongside Parliament might seem obvious at first. Yet analysing this within its context provides a different story. Most of the co-occurrences appear in the text slandering the New Model Army. In this text, the author believes that Parliament are voting in response to pressures from the NMA, and are therefore void of their privileges as a democratic union. A list of evidence is provided by the author to explain how Parliament have been revoked of their privileges (during the ‘tumult of the Apprentices’, when apprentices were freed from their masters and asked to join the NMA in ‘a state of confusion’). In the author’s view, the NMA forced Parliament to go against their morals by undoing their previous work.

I also gained access to trio output data from Susan and her work on Newsbooks, to see if I could find something similar. The top concepts for ‘King’ and ‘Parliament’ are ‘king – lord – parliament’ and ‘parliament – state – council’ respectively, which are expected from the genre of texts. By looking at slightly less frequent trios, there are some more intriguing items. The following trios complement the data I had found manually from the four-text sample: ‘king – people – liberty’ with a PMI of 3.09 and ‘parliament – present – authority’ with a PMI of 4.09. (Both PMIs show that the observed trios occur more often than expected by chance.) There is so much data to explore here, highlighting that LDNA should be an excellent resource for conducting quantitative study. As shown, it is important to analyse the findings within their contexts to specify how concepts are cemented in history. LDNA promote the combination of distant (using statistical methods) and close reading. It was interesting to imagine the final interface with the addition of literary analysis.

Working with the TCP metadata has allowed me to explore concepts within the Civil War and Commonwealth period and finalise the work placement by writing a collection brief. As a linguistics student, it has been a real challenge to identify texts based on their literary genre. This pushed me out of my comfort zone. Being able to use my semantic skills to pull apart the meaning behind the conceptual findings has helped too. I am very grateful to have been given the opportunity to use my existing skills and gain new ones!

Featured image: Hull City Skyline. From an original photograph by John Bannon. Used under license CC 3.0.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”

The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd). Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!

* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Experimenting with the imperfect: ECCO & OCR

When the Linguistic DNA project was first conceived, we aimed to incorporate more than 200 000 items from Eighteenth Century Collections Online (ECCO). Comparing findings for one portion of ECCO that has been digitised in different ways, this 2016 blogpost details why that ambition proved impractical. The public database uses ECCO-TCP as its main eighteenth-century source. Continue reading →