Category Archives: Blog Archive

Showcasing Linguistic DNA

On Saturday 11 March (2017), some of the LDNA team took part in a Showcase as part of the University of Sheffield’s Festival of the Arts and Humanities. The event took place at Sheffield’s Millennium Galleries, allowing members of the public to discover different aspects of humanities research presented through exhibitions, activities and short presentations. Visitors found information about literature or archaeological findings, had the possibility to try out different instruments or take an implicit bias test brought in by the Philosophy Department. We asked Sheffield postgraduates Nadia and Winnie to reflect on their experience preparing for and staffing a stall as part of their MA work placements.

Winnie writes:

I prepared a handout using data from Ways of Being in a Digital Age (WOBDA). The process—zooming from abstract trios extracted from a dataset to see the patterns they made in a small extract of text—was fascinating. At first I was worried about how well the concept would translate to a non-specialist audience, but then I realised that involves negative preconceptions about what a non-specialist audience is: somehow less interested or capable of critical engagement than a specialist one. I therefore decided not to “aim” anything “at” anyone, but instead tried to summarise trios in a way that made the most sense to me as a newcomer to Linguistic DNA’s methods. I chose a single pair (internet + craving), made up a colour-coded table of the items that formed trios with it, and then put this alongside highlighted examples of trios in a journal abstract.

Snapshot of a table showing associations with 'internet' and 'craving', with example text from a social sciences journal.

Illustrating patterns of association with “internet” + “craving” (from Winnie’s handout).

This turned out to be really useful because people were interested in the project from all kinds of angles, some of which changed how I thought about what LDNA does. Fiddling with data on the placement meant I’d got sidetracked in a sense into thinking about WOBDA as a technical exercise, but the Showcase helped me see the bigger picture. Visitors were intrigued by Linguistic DNA as a name; one person was interested in whether the project was making any claims about genetic hard-wiring. Another, an IT professional, was interested in the double helix visualisation on the website, and said it would make him think about his own designs. I particularly remember a conversation with an artist who was interested in researching discourses around disability. We talked about how to query corpora, which tools were available and easy to use, the advantages and disadvantages of the BNC versus the web as corpus, how the age of the BNC might affect the language it contained, and the difference between collocations and discourse concepts as shown in WOBDA. She was also interested in word clouds; the idea of extracting implicit relationships in language and making them visible seemed to be something that appealed strongly to both adults and children who stopped at the stall.

Nadia writes:

Beyoncé and crew salute military-style. Photo by Asterio Tecson.

Beyoncé Knowles performing in Central Park, July 2011. Image copyright (c) Asterio Tecson; used under creative copyright license 2.0.

To prepare for this event, I mostly focused on the YouTube data. I prepared an informative and colourful poster with prominent examples, including images of Beyoncé (left) and the video game World of Tanks, to attract visitors and to suggest that we conduct contemporary research. I also searched our data for some prominently occurring words.

Individual associations, courtesy of @ShefEnglish.

Individual associations at the LDNA stand, courtesy of @ShefEnglish on Twitter.

Because we do not yet have representative results for the Militarization 2.0 work, I often pointed to the Linguistic DNA research as the mother project of the YouTube project. The examples proved very useful since they complemented the information given on the posters. The audience was provided with representative examples from the Linguistic DNA project for them to look at and take away. Moreover, people had the chance to play with word cards and group them together according to their own individual word associations (right).

I observed that many people grouped together words with a similar meaning (such as ‘succeed’ and ‘win’), whereas others clustered together words according to very personal associations. An 8-year-old girl was fascinated by the cards, pairing ‘victory’ and ‘win’; we looked at how these words appear in our given examples and the advantages of having a computer that counts words as lemmas. One visitor told us about his aphasia and how it changed and affected his use of language, which made me realise that next to the Linguistic DNA we are researching, every person has his or her own, very personal linguistic DNA. Another visitor was inspired by the YouTube project and connected language use to social issues, such as the omnipresence of on- and offline violence, providing food for thought for all participants of the conversation.

From my point of view, the event was a success. Many visitors seized the opportunity to have a chat with us, which led to various stimulating encounters and conversations. It was intriguing to see that numerous people were willing to share personal stories and views on language and its importance. The public seemed to engage and identify with our project on many different levels, which confirms how important this kind of research is—not only for the academic community but also for the public.

Also participating in the Showcase were LDNA Research Associates Seth Mehl (below left), who delivered a bitesize talk asking “What can computers teach us about meaning in early English books?”, and Iona Hine (right, during her bitesize talk about “Luther’s Language”).

(Photos courtesy of @DHIShef and D. Clark.)

Looking back, looking forward: Linguistic DNA in 2016 and 2017

As we move into 2017, we’ve been looking back at achievements in 2016, and ahead to what we aim to achieve in the coming year.

2016 was an outwardly busy year as we travelled to Bruges, Essen, Krakow, Lausanne, Leeds, Brighton, Murcia, Nottingham, Paris, Saarbrucken, and Utrecht, sharing more of our thinking and early data with different audiences. Closer to “home”, we benefitted from the exchange of ideas with LDNA-hosted panels at Sheffield DH Congress and our second methodological workshop in Sussex. In 2017, we will be focusing back on our interface development and some more in-depth research, though we intend to be present at DH, SHEL, ICAME and SHARP, in order to continue some fruitful conversations.

On the blog, we have been reflecting on representativeness and the nature of EEBO-TCP. We’ve also documented our decision not to use ECCO’s OCR data to analyse eighteenth century print. You can expect to hear about the alternative 18th century datasets we’re choosing to work with later in 2017.

During the Autumn, the LDNA researchers collaborated on two articles about the project, its theory and praxis, both (hopefully) to be published this year following peer review. Generating examples from each research theme based on our early data and tying these together effectively was an enjoyable challenge, and we have already used the draft of one piece as part of our briefing materials for upcoming MA placements at The Digital Humanities Institute | Sheffield (formerly known as HRI Digital).

In the past six months, the Sheffield team have captured funding for two additional applications of the Linguistic DNA “concept modelling” tools:

The ESRC project Ways of Being in a Digital Age combines our quantitative insights with a qualitative literature survey of academic publications. Scheduled to inform the ESRC’s next programme of digital society funding, this impact-full study has compelled us toward rapid prototype development. The interface being put together to serve ‘WoBDA’ colleagues will also form the kernel of the subsequent LDNA workbench.
From next month, we are involved in another funded impact-related project, collaborating with the University of Leeds to explore the conceptual structure of millions of YouTube video comments on the theme of militarisation, as part of a larger project funded by the Swedish Research Council. This is a six-month commitment, bringing in a further research associate to theorise what’s involved in applying our measures to some very different data.

We also have three significant applications in place for other pots of funding, including Horizon 2020 collaborations, attesting confidence about our nascent processes and the multifarious opportunities for their application and impact.

Meanwhile, Glasgow has been using the present word co-occurrence data to develop its methodology for investigating processor data from the perspective of key Historical Thesaurus categories. We have continued to develop analysis of Thesaurus categories, looking for those which show abnormal instances of growth or decline; a provisional methodology for establishing statistical ‘baselines’ has been plotted out which is now being implemented and refined. Further possibilities are being tested, such as amalgamating data across whole layers of the HT hierarchy rather than by individual category, and the effects of separating out part of speech within categories or layers.

On “Lost Books” (ed. Bruni & Pettegree)

Review: Lost Books: Reconstructing the Print World of Pre-Industrial Europe. Ed. Flavia Bruni and Andrew Pettegree. Library of the Written Word 46 / The Handpress World 34. Leiden & Boston: Brill, 2016. 523 pages.

We solicited this book for review because we have been keenly aware that we cannot take what has been transcribed and preserved through the digitisation processes of Early English Books Online and the Text Creation Partnership as an accurate indication of all the material that was printed in the early modern period. Setting aside the idiosyncrasies of selectivity in the composition of EEBO-TCP, which have been documented elsewhere, there is a prior ‘selectivity’ about what survived to be catalogued.

The volume collects together the proceedings of the Lost Books conference held at the University of St Andrews in June 2014, and divisions within the volume loosely reflect those of the original call.

Pettegree’s introduction, “The Legion of the Lost” is a full-length essay discussing not only how books become lost but how one can know about what has been lost. It is accessible and engaging and would be a worthy reading assignment for undergraduates or masters students studying book history. As observed in a prior blogpost, “While the chapter … performs the function of uniting what follows, and does at times point to specific contents in the coming chapters, there is nothing of the clunkiness that one sometimes observes in the introduction of an edited collection.”

The two essays that follow both approach the challenge of assessing the loss of incunabula, i.e. print materials from pre-1500. Falk Eisermann begins with a comparison of the listings in the Gesamtkatalog der Wiegendrucke with the Incunabula Short Title Catalogue. He probes possible methods for distinguishing items that were printed (and lost) from items never-printed, giving examples from archival sources that defy expectation: “lost editions by unknown printers (sometimes located in incunabulistic ghost towns), containing texts not preserved anywhere else, even representing works of hitherto unrecorded authors” (43). The book historians’ task, one may imagine, is an uphill struggle; optimistically, there is fresh work to be done as no one has yet analysed the customary discussion of other printed works in paratext “with regard to dark matter” (50). Jonathan Green and Frank McIntyre (Chapter 3) aim to quantify the losses, offering an open discussion of the pitfalls of particular statistical approaches to this question. They recommend modelling the counts of surviving copies as a negative binomial distribution, accommodating correlation in loss and survival. For—and this is significant to LDNA—“books are not preserved or destroyed independently of each other” (59). Small items are more likely to survive if bound together; volumes in a library often share a common destiny. In addition, taste is a cultural construct with ideas of fashion and significance affecting more than one owner’s decision to dispose of or conserve. Taking into account variations of format, Green and McIntyre suggest that as much as 30 per cent of Quarto editions may have been lost entirely, comparing with 60 per cent of Broadsides and 15 per cent of Folios.

Part 2 is composed of national case-studies covering vocal scores from Renaissance Spain (Chapter 4, showing a markedly persistent repertoire conserved by copying when required); evidence of book ownership and circulation in pre-Reformation Scandinavia (Chapter 5, conducted with the help of inventories); the meticulous reconstruction of a lost Polish original on the basis of later editions (Chapter 6, touching also on the circulation of fortune-telling books throughout early modern Europe); a study of the Stationers’ Company Register (Chapter 7); a sheet-count- based model for calculating loss of seventeenth-century materials based on records for the Southern Netherlands—using the metadata-rich STCV, which also positions title-page engraving and roman typeface as features positively correlated with survival (Chapter 8); the identification of patterns of loss using book advertisements from the Dutch Republic (Chapter 9—exposing partly the proliferation of multiple localised editions); and a report weaving together a census of seventeenth-century Sicilian printing activity with a legal dispute over the library of Francesco Branciforti, attesting strong local attachment to this private collection (Chapter 10).

In Part 3, Christine Bénévent and Malcolm Walsby revisit the publication history of Guillaume Budé’s apophthegms (Chapter 11), combining careful study of the layout to demonstrate Gazeau’s compositor pretended to a new edition by replacing the first quire, with a call not to dismiss the “intellectual value” of later editions, noting the Paris copy of De L’Institution du Prince had the highest survival rate and was owned “by the most influential and powerful in early modern Europe” (including Edward VI, 252). Michele Camaioni aims to reconstruct a censored (but popular) mystical text using its censorship record (Chapter 12). Three further chapters draw on data from the RICI project, a study of Italian religious orders’ book ownership based on a Vatican-led census: Rosa Marisa Borraccini documents Girolamo de Palermo’s “unknown best-seller”, a devotional work running to “plausibly . . . more than one hundred” editions (Chapter 13); Roberto Rusconi probes weaknesses in the cataloguing, involving misspelt transcriptions, inadequate shorthand (opera omnia, etc.) and perhaps the deliberate disguising of works by disapproved authors (Chapter 14); and Giovanni Granata attempts to merge statistical extrapolation of lost works with study of specific lost editions based on the bibliographic records produced by the census (Chapter 15).

Part 4 is dedicated to lost libraries. Anna Giulia Cavagna observes the motives of Alfonso del Carretto, an exiled monarch whose self-catalogued collection prioritised texts pertaining (mostly through paratext such as dedications) to people whose powerful patronage he wished to secure, revealing books as “vectors of social relations” (357, Chapter 16). Martine Julia van Ittersum pursues the preservation and loss of Hugo Grotius’ personal collections, observing that preservation required “neglect, though not too much of it” (384) and that the preservation of printed materials was correlated with loss of manuscript (Chapter 17). Federico Cesi, the target of Maria Teresa Biagetti’s study (Chapter 18), was the founder of the Accademia dei Lincei in Rome; his now dispersed collection included works of botany, zoology, alchemy, and medical texts, its components known through correspondence and post mortem inventory. Sir Hans Sloane’s collections, including printed books “estimated at about 45,000 volumes”, formed the kernel of what is now the British Library; Alison Walker explains the difficulties of tracing Sloane’s books, which when duplicated by other collections were often dispersed through sale or gifting, or migrated at the creation of new specialist institutions such as the National History Museum. By reconstructing the collection, Walker argues, one may attain a “reflection . . . of the intellectual environment of the day” and of “Sloane himself as a scientist and physician” (412, Chapter 19). The last chapter in Part 4 outlines the hopes of the AHRC research network ‘Community Libraries: Connecting Readers in the Atlantic World’, using a case study from Wigtown (NW Scotland) to show how archival resources about the creation and use of libraries yield insight into sociability (Chapter 20); we find widows borrowing while patrons gain more from the bureaucracy and facilitation than the library’s holdings.

The last section (Part 5), entitled “War and Peace”, considers the woes that have befallen historic collections in more recent times. Jan Alessandrini discusses Hamburg’s collections, protection measures during the Second World, the seizure of private Jewish libraries, and the political challenges of reconstruction (with some prospect of help from Russian digitisation, Chapter 21). Tomasz Nastulczyk acknowledges that “Swedish pillaging paradoxically helped to preserve” books from the Polish-Lithuanian Commonwealth that might otherwise have been lost (462, Chapter 22). Co-editor Flavia Bruni writes of the successful preservation of Italian archives and libraries aided by “a clear and centralised policy” in WW2, arguing that “international agreements” are also essential if cultural heritage is to be preserved (484, Chapter 23). The closing chapter is devoted to broadsheet ordinances, lost—or perhaps missing—as a result of the collapse of Cologne city archives in 2009; happily, microfilm means all is not lost, and Saskia Limbach also successfully traces invoices and other evidence of print activity through a range of archival sources (Chapter 24).

It will be evident from this account that the case studies are drawn from across Europe, with three chapters directly addressing British material. Of these, Alexandra Hill’s intersects most closely with the period Linguistic DNA has focused on so far, with the Register containing “with some exceptions [e.g. government publications and school books], . . . all the books authorised to be printed during the Elizabethan, Jacobean and early Caroline periods” (144–5). Comparing this information with the English Short Title Catalogue, Hill shows that for the 1590s, the survival rate of fiction and ballads is significantly lower than other genres of publication; in addition, within a relatively well-preserved domain such as religious literature, subcategories may fare disproportionately badly as is the case for prayer books, destroyed—Hill hypothesises—by continual use. These kinds of absences need to be borne in mind as we proceed to analyse the survivors. Of course, given the cultural traffic of early modern Europe, much of what is learned from non-British collections is also relevant for thinking critically about how texts survived, how others were lost, and how Linguistic DNA should correspondingly limit the claims built on the print discourse of EEBO-TCP.

As of summer 2018, Lost Books is now open access, and freely available online for all to read.

LDNA at Digital Humanities Congress 2016, Sheffield

LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:

Continue reading →

Language, visualisation and methodology: our second workshop

Monday 5 September saw the Linguistic DNA team camping out at the University of Sussex for our second methodological workshop. This year the theme was “Visualisation and Language Change”, and we’ve harnessed the powers of Storify to put together a short account of a long and enjoyable day’s work. See more (on Storify).

What does EEBO represent? Part II: Corpus linguistics and representativeness

What exactly does EEBO represent? Is it representative?

Often, the question of whether a corpus or data set is representative is answered first by describing what the corpus does and does not contain. What does EEBO contain? As Iona Hine has explained here, EEBO contains Early Modern English, but it is much larger than that in some ways, and also much more limited than that. EEBO contains many languages other than English, which were printed in the British Isles (and beyond) between 1476 and 1700. But EEBO is also limited: it contains only print, whereas Early Modern English was also hand-written and spoken, across a large number of varieties.

Given that EEBO contains Early Modern print, does EEBO represent Early Modern print? In order to address this question meaningfully, it’s crucial first to define representativeness.

In corpus linguistics, as in other data sciences and in statistics, representativeness is a relationship that holds between a sample and a population. A sample represents a larger population if the sample was obtained rigorously and systematically in relation to a well-defined population. If the sample is not representative in this way, it is an arbitrary sample or a convenience sample – i.e. it was not obtained rigorously and systematically in relation to a well-defined population. Representativeness allows us to examine the sample and then draw conclusions about the population. This is a fundamental element of inferential statistics, which is used in data science from epidemiology to corpus linguistics.

Was EEBO sampled systematically and rigorously in relation to a well-defined population? Not at all. EEBO was sampled arbitrarily, by convenience – first, including only texts that have (arbitrarily) survived; then including texts that were (arbitrarily) available for scanning and transcription; and, finally, including those texts that were (arbitrarily) of interest to scholars involved with EEBO at the time. Could we, perhaps, argue that EEBO represents Early Modern print that survived until the 21st century, was available for scanning and transcription, and (in many cases) was of interest to scholars involved with the project at the time? I think we would have to concede that EEBO wasn’t sampled systematically and rigorously in relation to that definition, and that the arbitrary elements of that population render it ill-defined.

So, what does EEBO represent? Nothing at all.

It’s difficult, therefore, to test research questions using inferential statistics. For example, we might be interested in asking: Do preferences for the near-synonyms civil, public, and civic change over time in Early Modern print? We can pursue such a question in a straightforward way, looking at frequencies of each word over time, in context, to see if there are changes in use, with each word rising or falling in frequency. In fact, we can quite reliably discern what happens to these preferences within EEBO. But our question, as stated, was about Early Modern print. It is the quantitative step from the sample (EEBO) to the population (Early Modern print) that is problematic. Suppose that we do find a shifting preference for each of these words over time. Because EEBO doesn’t represent the population of Early Modern print in any clear way, we can’t rely on statistics to conclude that that this is in fact a correlation between preferences and time – or if it is, instead, an artefact of the arbitrariness of the sampling. The observation might be due to any number of textual or sociolinguistic variables that were left undefined in our arbitrary sample – including variation in topics, or genres, or authorial style, or even authors’ gender, age, education, or geographic profile.

It as though we were testing children’s medication on an arbitrary group of people who happened to be walking past the hospital on a given day. That’s clearly a problem. We want to be sure that children’s medication was tested on children – but not simply children, because we also want to be sure that it isn’t tested on children arbitrarily sampled, for example, from an elite after-school athletics programme for 9-year-olds that happens to be adjacent to the hospital. We want the medication to be tested on a systematic cross-section of children, or on a group of children that we know is composed of more and less healthy kids across a defined age range, so that we can draw conclusions about all children, based on our sample. If we use a statistical analysis of EEBO (an arbitrary sample) to draw conclusions about Early Modern print (a population), it’s as though we’re using an arbitrary sample of available kids to prove that a medication is safe for the population of all kids. (Linguistics is a lot safer than epidemiology.)

If one were interested in reliably representing extant Early Modern print, one might design a representative sample in various ways. It would be possible to systematically identify genres or topics or even text lengths, and ensure that all were sampled. If we took on such a project, we might want to ensure sampling all genders, education levels, and so on (indeed, historical English corpora such as the Corpus of English Dialogues, or ARCHER, are systematically sampled in clear ways). We would need to take decisions about proportionality – if we’re interested in comparing the writing of men and women, for example, we might want large, equal samples of each group. But if we wanted proportional representation across the entire population of writers, we might include a majority of men, with a small proportion of women – reflecting the bias in Early Modern publishing. Or, we might go further and attempt to represent not the bias in Early Modern publication, but instead the bias in Early Modern reception, attempting to represent how many readers actually read women’s works compared to men’s works (though such metadata isn’t readily available, and obtaining it would be a project in itself). Each of these decisions might be appropriate for different purposes.

So, what are we to do? LDNA hasn’t thrown stats out the window, nor have we thrown EEBO out the window. But we are careful to remember that our statistics are describing EEBO rather than indicating conclusions about a broader population. And we haven’t stopped there – we will draw conclusions about Early Modern print, but not via statistics, and not simply via the sample that is EEBO. Instead, we will draw such conclusions as close readers, linguists, philologists, and historians. We will use qualitative tools and historical, social, cultural, political, and economic insights about Early Modern history, in systematic and rigorous ways. Our intention is to read texts and contexts, and to evaluate those contexts in relation to our own knowledge about history, society, and culture. In other words, we are taking a principled interpretive leap from EEBO to Early Modern print. That leap is necessary, because there’s no inherent representative connection between the two.

Under the surface: SHARP, LDNA and sundry sources

This blog post excerpts material Iona wrote reflecting back on her contribution to the SHARP conference in Paris in July 2016, building on the work of her PhD thesis and incorporating material and processes that have formed part of the Linguistic DNA project. The full post can be found on Iona’s personal blog.

In preparation for the paper, I dedicated time to manually extract, compile and refine measurements for some of the early outputs from the LDNA processor. To fit in with the pledges of my abstract, I targeted the associations of valour and valiant in subsets of EEBO-TCP.

During my PhD, I used EEBO-TCP to provide context for my work with early modern bibles. Valour entered the equation as I examined trends in the translation of a Hebrew collocation gibbor chayil. In the King James Version (publ. 1611) most gibbor chayil men are “mighty . . . of valour”. The repetition of this phrase across the translation means that English bible readers could form associations between the group of characters referred to, in a similar manner to those who encounter the Hebrew narrative directly. For this to happen in translation shows that the translators recognised and (sometimes) prioritised the transmission of this connection; in this respect “mighty of valour” is a partial example of a larger trend in favour of a more technical approach to translation, a move likely influenced by the increasing use of precise cross-referencing in bible reading (facilitated by the introduction of verse numbers throughout the Bible, an innovation of the 1550s). Yet the phrase is intrinsically interesting because before that “valour” was not part of the English biblical lexicon.

Collating instances of gibbor chayil demonstrates that the lexically related “valiant” was used in earlier translations, but in a piecemeal manner (illustrated by the changing distribution of black square bullets in the diagram below).

This diagram, extracted from my SHARP presentation, is one of a series colour-coded to highlight consistency within individual versions with a focus on the characterisation of Boaz. The black square bullets are added to highlight where a form of ‘valiant’ (or for KJ ‘valour’) was used.

By exploring the words valiant and valour with the LDNA tools, I was able to corroborate the impression I had formed during my earlier quantitative and qualitative analysis which was conducted via a standard EEBO-TCP interface.

The PhD bit

Searching hits in the population for the first century of English print (to 1570) and comparing that with the next half century (a collection of documents three times the size) I had observed that the frequency of both valiant and valour increased markedly above expectation.

Comparison of word frequency (hits) and distribution (records, hits per record) in EEBO-TCP for 1473-1570 (P1) and 1571-1620 (P2) expressed in ratios.

Scrutinising the data by decade exposed some significant textual influences. To quote from my thesis:

87 per cent of occurrences of “valiant” in the corpus for 1520-1529 (316 of a total 363) appear in a two-volume translation of the French chronicles of Froissart, while two other translated works account for a further 9 per cent; just 4 per cent of hits occur in ‘indigenous’ texts.

For “valour”,

a jump in the decade 1570-1579 is significantly related to the publication in 1579 of a translation from Italian: 403 of the decade’s 501 hits appear in a one-volume translation of The historie of Guicciardin conteining the vvarres of Italie and other partes (London, 1559). Once such scrutiny is imposed, it becomes evident that translation had a significant role in the increased currency of these two Latinate terms. It is also evident that the words normally appear in certain genres: conduct books concerned with warfare and chivalric behaviour; and chronicles of past history. This contributes to the recognisable sense of valour as “The quality of mind which enables a person to face danger with boldness or firmness; courage or bravery, esp. as shown in warfare or conflict; valiancy, prowess.”[ OED s.v. “valour|valor, n.”, §1c.] This sense, cultivated through translation in the course of the sixteenth-century, fits the context in which King James’ translators employ the word.

The LDNA bit

The subsets of EEBO-TCP sent through the LDNA processor earlier in the year were intentionally compatible with the periodisation of my thesis, providing windows onto English discourse that could be cross-referenced with the publication of particular bibles. The subsets thus incorporate all transcribed material from EEBO (TCP update 2015) known to have been printed during the following spans:

1520-1539 (cf. Coverdale Bible 1535, Matthew Bible 1537, Great Bible 1539)
1550-1559 (Geneva Bible 1560, Bishops Bible 1568); and
1610-1611 (Douai Old Testament 1609-10, King James Version 1611).

Taking the first and last of these, measuring PMI in windows of discourse around the word “valour”, we find marked change in the prominent associations. Our approach yields plentiful data, and we are still thinking through the challenges of visualisation. In the slide shown, I have coloured associated terms according to the innermost window in which the cooccurring lemma rises to prominence. Thus red terms occur frequently in the narrowest window around valour (+/-1 words), orange terms in the expanded window (+/-10 words) that might approximate the surrounding sentence, green for +/-50 words (which now form the default window size in our public interface) and blue for the wide discursive window of +/-100 words. (Many lemmas appear in more than one window, and the list shown for the later period does not reach to some relevant low frequency items such as “prowess”.)

What should be visible is a distinction between the use of “valour” as a synonym of value or worth (prominent in the 1520-1539 subset), and the association with conduct in conflict (dominant in the 1610-1611 dataset). Both senses were part of the Latin root “valeo” and, had King James’ translators ventured it, both could have been played upon to make even more “mighty men of valour” in 1611. (One of the exceptions comes at 2 Kings 15:20, where Menachem taxes all gibbor chayil men, “mighty men of wealth” in the KJV.)

Inevitably, the set of observations I could draw from this investigation are not part of the bottom-up process that LDNA strives to achieve. But the exercise has helped me to think through some different ways we will want to be able to interrogate our data and to study the effects of some different baselines for our expectation calculations. And it demonstrates, I think, the valour of conducting semantic enquiries through discursive windows.

_____

Notes

Thesis quotations are from: I. C. Hine, “Englishing the Bible in early modern Europe: The case of Ruth”, PhD thesis (University of Sheffield, 2014), p. 163. These numbers reflect searches conducted through the Chadwyck EEBO interface using its variant spelling option.

The datasets employed in my thesis are not quite identical to those used by the project: LDNA uses a slightly expanded version of the EEBO-TCP collection (last updated early 2015) with its spelling regularised and tokens lemmatised locally using MorphAdorner.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”

The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd). Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!

* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Digital Humanities 2016, Kraków

Conference reflections jointly written with Justyna Robinson

Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.

Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.

Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.

At the poster session

Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. [1]

Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised. The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.

However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.

For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.

———

1. Studying Linguistic Changes on 200 Years of Newspapers, Vincent Buntinx, Cyril Bornet, Frédéric Kaplan (EPFL (École polytechnique fédérale de Lausanne), Switzerland)

Text Analytics at Sheffield DH Congress

Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.

As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.

If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.

Last registration date is 7 September.