Author Archives: Iona Hine

Talk About Change

In a time when events seem ever and ever out of our control, writing is resistance.
–Our Mel.

In April (2018), Linguistic DNA began collaborating with local social entrepreneurs Our Mel to do some collective thinking about the power of language. This work is funded by the University of Sheffield’s Festival of the Mind and our work together will culminate in a spoken-word performance in the Festival’s Spiegeltent (pictured) this September.

The collaboration also involves 500 Reformations: exploring stories of change, from 1517 to 2018, a University of Sheffield public engagement project headed up by Linguistic DNA researcher Iona Hine.

Together, our goal is “TALK ABOUT CHANGE”.

More specifically Talk About Change is pursuing conversations about the history and power of language, particularly as experienced by people of colour. The first sessions will incorporate a provocation based on historical research, working through themes including diversity, feminisms, race, and resilience. Talking, sharing, debating, we hope participants will join us and engage in acts of creative resistance—in thought, speech, and writing.

What are we actually doing?

Throughout July and August, novelist and creative writer Désirée Reynolds will be leading a series of workshops, hosted by Our Mel, to discuss words and themes including race, feminisms, and diversity. The July workshops are themed and will each include input from a University of Sheffield researcher. The August workshops continue to explore related ideas, developing creative writing under the common heading “writing is resistance”.

Those who choose may publish their writing in an anthology, and we will also present a collective spoken-word performance (optional!) on Sunday 23 September as part of the Festival of the Mind programme.

Who can participate?

Our Mel issue a collective invite to come along and engage in conversation about “words that affect us every day”. What have they meant, how are they used, and what do they mean to us?

People of all ethnicities are welcome and an embracement of heritage is welcomed. Participation is limited to over 18s.

Visit Our Mel’s website ( for more information about the workshops.



Logo for Melanin Fest

Rooted in Yorkshire and based in Sheffield, OUR MEL is a social enterprise dedicated to exploring cultural identity, Black history and what it means to be a person of colour in Britain today. Inspired by two local lasses (Annalisa Toccara & Gabriela Thompson-Menanteaux) on a journey of self-love, Our Mel was born in November 2016 over a pack of caramel biscuits and a cup of tea, Yorkshire of course. Since its birth, Our Mel has grown into a community of people on a mission to support, encourage, teach and build the community through music, film, arts and education. In October 2017, we launched Sheffield’s first collaborative Black History month festival, MelaninFest, and its sister MelaninFest in London. 1300 people attended 43 events in Sheffield and 5 in London. Our Mel has been at the forefront of creating diversity, inclusion and representation in Sheffield since November 2016, working in collaboration with festivals and organisations both nationally and internationally.  @our__mel

ANNALISA TOCCARA is a Marketer & PR professional, Community Activist & Creative Director. Based in Sheffield and founder of the social enterprise Our Mel, Annalisa launched Sheffield’s first Black History Month Festival; MelaninFest® in October 2017, which saw a total of 43 events spread across the month in collaboration with over 40 partners and also launched a sister festival in London. Since then, Annalisa has hosted a number of community events celebrating Black excellence, Black talent and Womanhood. Through her work with Our Mel and previous social justice endeavours, she has developed a passion for arts and culture having seen first-hand how creative mediums can help shape and create social cohesion within our community. Annalisa also has a BA (Hons) in Biblical Study and Applied Theology with a Diploma in Leadership and is currently studying for her Chartered Marketer status. She is also the Vice-Chair of the BAMER Hub – Sheffield’s Equality Hub Network.  @sparklelikegold

DÉSIRÉE REYNOLDS started her writing career in South London as a freelance journalist for the Jamaica Gleaner and the Village Voice. She has since written film scripts, poetry and short stories. Some of her shorts are published on SABLE E-Mag and various anthologies. “Seduce” her first novel was published by Peepal Tree Press in 2013, to much acclaim. She continues to work as journalist, teacher, broadcaster and DJ. Desiree is currently working on a collection of short stories, a novel based on the Haitian revolution and her PhD. — “After spending a lot of time, doing lots of things, I’m finally where I’m supposed to be, doing what I’m supposed to do.”


500 REFORMATIONS collaborates with external partners to explore and tell stories of change, from the cultural to the personal. Based at the University of Sheffield, 500 Reformations draws on research from across the Faculty of Arts and Humanities. Activities are united by the theme of reformation, whether writ big (as e.g. churches breaking away from Roman Catholic control in the sixteenth century, ‘the Reformation’) or small (in individual stories of change, development and re-form). @500Reformations

Lost Books

On “Lost Books” (ed. Bruni & Pettegree)

Review: Lost Books: Reconstructing the Print World of Pre-Industrial Europe. Ed. Flavia Bruni and Andrew Pettegree. Library of the Written Word 46 / The Handpress World 34. Leiden & Boston: Brill, 2016. 523 pages.

We solicited this book for review because we have been keenly aware that we cannot take what has been transcribed and preserved through the digitisation processes of Early English Books Online and the Text Creation Partnership as an accurate indication of all the material that was printed in the early modern period. Setting aside the idiosyncrasies of selectivity in the composition of EEBO-TCP, which have been documented elsewhere, there is a prior ‘selectivity’ about what survived to be catalogued.

The volume collects together the proceedings of the Lost Books conference held at the University of St Andrews in June 2014, and divisions within the volume loosely reflect those of the original call.

Pettegree’s introduction, “The Legion of the Lost” is a full-length essay discussing not only how books become lost but how one can know about what has been lost. It is accessible and engaging and would be a worthy reading assignment for undergraduates or masters students studying book history. As observed in a prior blogpost, “While the chapter … performs the function of uniting what follows, and does at times point to specific contents in the coming chapters, there is nothing of the clunkiness that one sometimes observes in the introduction of an edited collection.”

The two essays that follow both approach the challenge of assessing the loss of incunabula, i.e. print materials from pre-1500. Falk Eisermann begins with a comparison of the listings in the Gesamtkatalog der Wiegendrucke with the Incunabula Short Title Catalogue. He probes possible methods for distinguishing items that were printed (and lost) from items never-printed, giving examples from archival sources that defy expectation: “lost editions by unknown printers (sometimes located in incunabulistic ghost towns), containing texts not preserved anywhere else, even representing works of hitherto unrecorded authors” (43). The book historians’ task, one may imagine, is an uphill struggle; optimistically, there is fresh work to be done as no one has yet analysed the customary discussion of other printed works in paratext “with regard to dark matter” (50). Jonathan Green and Frank McIntyre (Chapter 3) aim to quantify the losses, offering an open discussion of the pitfalls of particular statistical approaches to this question. They recommend modelling the counts of surviving copies as a negative binomial distribution, accommodating correlation in loss and survival. For—and this is significant to LDNA—“books are not preserved or destroyed independently of each other” (59). Small items are more likely to survive if bound together; volumes in a library often share a common destiny. In addition, taste is a cultural construct with ideas of fashion and significance affecting more than one owner’s decision to dispose of or conserve. Taking into account variations of format, Green and McIntyre suggest that as much as 30 per cent of Quarto editions may have been lost entirely, comparing with 60 per cent of Broadsides and 15 per cent of Folios.

Part 2 is composed of national case-studies covering vocal scores from Renaissance Spain (Chapter 4, showing a markedly persistent repertoire conserved by copying when required); evidence of book ownership and circulation in pre-Reformation Scandinavia (Chapter 5, conducted with the help of inventories); the meticulous reconstruction of a lost Polish original on the basis of later editions (Chapter 6, touching also on the circulation of fortune-telling books throughout early modern Europe); a study of the Stationers’ Company Register (Chapter 7); a sheet-count- based model for calculating loss of seventeenth-century materials based on records for the Southern Netherlands—using the metadata-rich STCV, which also positions title-page engraving and roman typeface as features positively correlated with survival (Chapter 8); the identification of patterns of loss using book advertisements from the Dutch Republic (Chapter 9—exposing partly the proliferation of multiple localised editions); and a report weaving together a census of seventeenth-century Sicilian printing activity with a legal dispute over the library of Francesco Branciforti, attesting strong local attachment to this private collection (Chapter 10).

In Part 3, Christine Bénévent and Malcolm Walsby revisit the publication history of Guillaume Budé’s apophthegms (Chapter 11), combining careful study of the layout to demonstrate Gazeau’s compositor pretended to a new edition by replacing the first quire, with a call not to dismiss the “intellectual value” of later editions, noting the Paris copy of De L’Institution du Prince had the highest survival rate and was owned “by the most influential and powerful in early modern Europe” (including Edward VI, 252). Michele Camaioni aims to reconstruct a censored (but popular) mystical text using its censorship record (Chapter 12). Three further chapters draw on data from the RICI project, a study of Italian religious orders’ book ownership based on a Vatican-led census: Rosa Marisa Borraccini documents Girolamo de Palermo’s “unknown best-seller”, a devotional work running to “plausibly . . . more than one hundred” editions (Chapter 13); Roberto Rusconi probes weaknesses in the cataloguing, involving misspelt transcriptions, inadequate shorthand (opera omnia, etc.) and perhaps the deliberate disguising of works by disapproved authors (Chapter 14); and Giovanni Granata attempts to merge statistical extrapolation of lost works with study of specific lost editions based on the bibliographic records produced by the census (Chapter 15).

Part 4 is dedicated to lost libraries. Anna Giulia Cavagna observes the motives of Alfonso del Carretto, an exiled monarch whose self-catalogued collection prioritised texts pertaining (mostly through paratext such as dedications) to people whose powerful patronage he wished to secure, revealing books as “vectors of social relations” (357, Chapter 16). Martine Julia van Ittersum pursues the preservation and loss of Hugo Grotius’ personal collections, observing that preservation required “neglect, though not too much of it” (384) and that the preservation of printed materials was correlated with loss of manuscript (Chapter 17). Federico Cesi, the target of Maria Teresa Biagetti’s study (Chapter 18), was the founder of the Accademia dei Lincei in Rome; his now dispersed collection included works of botany, zoology, alchemy, and medical texts, its components known through correspondence and post mortem inventory. Sir Hans Sloane’s collections, including printed books “estimated at about 45,000 volumes”, formed the kernel of what is now the British Library; Alison Walker explains the difficulties of tracing Sloane’s books, which when duplicated by other collections were often dispersed through sale or gifting, or migrated at the creation of new specialist institutions such as the National History Museum. By reconstructing the collection, Walker argues, one may attain a “reflection . . . of the intellectual environment of the day” and of “Sloane himself as a scientist and physician” (412, Chapter 19). The last chapter in Part 4 outlines the hopes of the AHRC research network ‘Community Libraries: Connecting Readers in the Atlantic World’, using a case study from Wigtown (NW Scotland) to show how archival resources about the creation and use of libraries yield insight into sociability (Chapter 20); we find widows borrowing while patrons gain more from the bureaucracy and facilitation than the library’s holdings.

The last section (Part 5), entitled “War and Peace”, considers the woes that have befallen historic collections in more recent times. Jan Alessandrini discusses Hamburg’s collections, protection measures during the Second World, the seizure of private Jewish libraries, and the political challenges of reconstruction (with some prospect of help from Russian digitisation, Chapter 21). Tomasz Nastulczyk acknowledges that “Swedish pillaging paradoxically helped to preserve” books from the Polish-Lithuanian Commonwealth that might otherwise have been lost (462, Chapter 22). Co-editor Flavia Bruni writes of the successful preservation of Italian archives and libraries aided by “a clear and centralised policy” in WW2, arguing that “international agreements” are also essential if cultural heritage is to be preserved (484, Chapter 23). The closing chapter is devoted to broadsheet ordinances, lost—or perhaps missing—as a result of the collapse of Cologne city archives in 2009; happily, microfilm means all is not lost, and Saskia Limbach also successfully traces invoices and other evidence of print activity through a range of archival sources (Chapter 24).

It will be evident from this account that the case studies are drawn from across Europe, with three chapters directly addressing British material. Of these, Alexandra Hill’s intersects most closely with the period Linguistic DNA has focused on so far, with the Register containing “with some exceptions [e.g. government publications and school books], . . . all the books authorised to be printed during the Elizabethan, Jacobean and early Caroline periods” (144–5). Comparing this information with the English Short Title Catalogue, Hill shows that for the 1590s, the survival rate of fiction and ballads is significantly lower than other genres of publication; in addition, within a relatively well-preserved domain such as religious literature, subcategories may fare disproportionately badly as is the case for prayer books, destroyed—Hill hypothesises—by continual use. These kinds of absences need to be borne in mind as we proceed to analyse the survivors. Of course, given the cultural traffic of early modern Europe, much of what is learned from non-British collections is also relevant for thinking critically about how texts survived, how others were lost, and how Linguistic DNA should correspondingly limit the claims built on the print discourse of EEBO-TCP.

As of summer 2018, Lost Books is now open access, and freely available online for all to read.

Select a language... option.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”

The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd).  Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!

* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Chart showing frequency of stem "transl-" in ECCO OCR as % of TCP.

Experimenting with the imperfect: ECCO & OCR

When the Linguistic DNA project was first conceived, we aimed to incorporate more than 200 000 items from Eighteenth Century Collections Online (ECCO). Comparing findings for one portion of ECCO that has been digitised in different ways, this 2016 blogpost details why that ambition proved impractical. The public database uses ECCO-TCP as its main eighteenth-century source. Continue reading

Dr Kris Heylen at the Humanities Research Institute, Sheffield

Learning with Leuven: Kris Heylen’s visit to the HRI

In 2016, Dr Kris Heylen (KU Leuven) spent a week in Sheffield as a HRI Visiting Fellow, demonstrating techniques for studying change in “lexical concepts” and encouraging the Linguistic DNA team to articulate the distinctive features of the “discursive concept”.

Earlier this month, the Linguistic DNA project hosted Dr Kris Heylen of KU Leuven as a visiting fellow (funded by the HRI Visiting European Fellow scheme). Kris is a member of the Quantitative Lexicology and Variational Linguistics (QLVL) research group at KU Leuven, which has conducted unique research into the significance of how words cooccur across different ‘windows’ of text (reported by Seth in an earlier blogpost). Within his role, Kris has had a particular focus on the value of visualisation as a means to explore cooccurrence data and it was this expertise from which the Linguistic DNA project wished to learn.

Kris and his colleagues have worked extensively on how concepts are expressed in language, with case studies in both Dutch and English, drawing on data from the 1990s and 2000s. This approach is broadly sympathetic to our work in Linguistic DNA, though we take an interest in a higher level of conceptual manifestation (“discursive concepts”), whereas the Leuven team are interested in so-called “lexical concepts”.

In an open lecture on Tracking Conceptual Change, Kris gave two examples of how the Leuven techniques (under the umbrella of “distributional semantics”) can be applied to show variation in language use, according to context (e.g. types of newspaper) and over time. A first case study explored the notion of a ‘person with an immigration background’ looking at how this was expressed in high and low brow Dutch-language newspapers in the period from 1999 to 2005. The investigation began with the word allochtoon, and identified (through vector analysis) migrant as the nearest synonym in use. Querying the newspaper data across time exposed the seasonality of media discourse about immigration (high in spring and autumn, low during parliamentary breaks or holidays). It was also possible to document a decrease in ‘market share’ of allochtoon compared with migrant, and—using hierarchical cluster analysis—to show how each term was distributed across different areas of discourse (comparing discussion of legal and labour-market issues, for example). A second comparison examined adjectives of ‘positive evaluation’, using the Corpus of Historical American English (COHA, 1860-present). Organising each year’s data as a scatter plot in semantic space, the path of an adjective could be traced in relation to others—moving closer to or apart from similar words. The path of terrific from ‘frightening’ to ‘great’ provided a vivid example of change through the 1950s and 1960s.

During his visit, Kris explored some of the first outputs from the Linguistic DNA processor, material printed in the British Isles (or in English) in two years, 1649 and 1699, transcribed for the Text Creation Partnership, and further processed with the MorphAdorner tool developed by Martin Mueller and Philip Burns at NorthWestern. Having run this through additional processes developed at Leuven, Kris led a workshop for Sheffield postgraduate and early career researchers and members of the LDNA team in which we learned different techniques for visualising the distribution of heretics and schismatics in the seventeenth-century.

The lecture audience and workshop participants were drawn from fields including English Literature, History, Computer Science, East Asian Studies, and the School of Languages and Cultures. Prompted partly by the distribution of the Linguistic DNA team (located in Sussex and Glasgow as well as Sheffield), both lecture and workshop were livestreamed over the internet, extending our audiences to Birmingham, Bradford, and Cambridge. We’re exceedingly grateful for the technical support that made this possible.

Time was also set aside to discuss the potential for future collaboration with Kris and others at Leuven, including participation of the QLVL team in LDNA’s next methodological workshop (University of Sussex, September 2016) and other opportunities to build on our complementary fields of expertise.


Word-cloud for this blog post (generated with Wordle)

Liest thou, or hast a Rewme? Getting the best from VARD and EEBO

This post from August 2015 continues the comparison of VARD and MorphAdorner, tools for tackling spelling variation in early modern English. (See earlier posts here and here.) As of 2018, data on our public interface was prepared with an updated version of MorphAdorner and some additional curation from Martin Mueller at NorthWestern.

This week, we’ve replaced the default VARD set-up with a version designed to optimise the tools for VARD. In essence, this includes a lengthier set of rules to guide the changing of letters, and lists of words and variants that are more suited to the early modern corpus.

It is important to bear in mind that the best use of VARD involves someone ‘training’ it, supervising and to a large extent determining the correct substitutions. But because Linguistic DNA is tackling the whole EEBO-TCP corpus, and the mass of documents within it is far from homogenous, it is difficult to optimise it effectively.

Doth VARD recognise the second-person singular?

A first effort with the EEBO set-up was to review the understanding formed about how VARD works in relation to verb conjugations for the second and third persons singular. A custom text was written to test the output (using the 50% threshold for auto-normalisation as previously):

If he lieth thou liest. When she believeth thou leavest. 
If thou believest not, he leaveth. Where hast thou been? 
When hadst thou gone? Where hath he walked? 
Where goest thou? Where goeth he?
What doth he? What doeth he? What dost thou? 
What doest thou? What ist? What arte doing?

Most of the forms were modernised just as described in the previous post. However, some of the output gave cause for concern. In the first sentence, “liest” became “least”. Further on “goest” became “goosed”, “doest” was accepted as a non-variant, while both “hast” and “dost” were highlighted as unresolved variants. This output can be explained by looking at the variant and word lists and the statistical measures VARD uses.

VARD’s use of variants and word frequencies

Scrutinising the word and variant lists within the EEBO set-up showed that although the variant list recorded “doest” as an instance of “dost”, “doest” and not “dost” appeared in the word list, overriding that variant. Similarly, “ha’st” appears in the variant list as a form of “hast”, but “hast” is not in the word list. It is not difficult to add items to the word list, but the discrepancies in the list contents are surprising. In fact, it might be more appropriate for VARD to record “doest” as a variant of “do”, and “ha’st” of “have”.

For “liest”, the correct variant and word entries are present so that “liest” can be amended to “lie”, giving a known variant [KV] recall score of 100% (indicating this is not a known variant form of any other word). However, the default parameters (regardless of the F-score) favour “least” because that amendment strongly satisfies the other three criteria: letter replacement [LR] (the rules), phonetic matching [PM], and edit distance [ED]. Until human judgment intervenes with the weighting, “least” has the better statistical case. (Much the same applies to “goest” and “goosed”.)

In VARD’s defence, one need only intervene with any of the “-st” verb endings in the text once (before triggering the auto-normalisation process) for the weighting to shift in favour of “liest”. VARD learns well.

Rewme: space, cold, or dominion?

One of the ‘authentic’ EEBO extracts we’ve been testing with is taken from a medical text, A rich store-house or treasury for the diseased, 1596 (TCP A13300). As mentioned in a previous post, employing VARD’s automated normalisation with the default 50% threshold, references to “Rewme” becomes “Room”. Looking again at what is happening beneath the surface, the first surprise is that there is an entry for “rewme” in the variant list, specifying it as a known variant of “room”. This is unsatisfying with regard to EEBO-TCP: a search of the corpus shows that the word form “rewme” appears in 89 texts. Viewing each instance through Anupam Basu’s keyword-in-context interface shows that in 84 texts, “rewme” is used with the meaning “rheum”. Of the other five texts, one is Middle English biblical exegesis (attributed to John Purvey); committed to print as late as 1550, the text repeatedly uses “rewme” with the sense “realm” or “kingdom” (both earthly and divine). The remaining four were printed after 1650 and are either political or historical in intent, similarly using “rewme” as a spelling of “realm”. Nowhere in EEBO-TCP does “rewme” appear with the sense “room”. However, removing it from the known variants (by setting its value to zero) and adding new variant entries for realm and rheum does not result in the desired auto-normalisation: The fact that both realm and rheum are candidates means their KV recall score is halved (50%). At the same time, the preset frequencies strengthen room’s position (309) compared with realm (80) and rheum (50). In fact, the word list accompanying the EEBO set-up seems still to be based on the BNC corpus—featuring robotic (OED 1928) and pulsar (OED 1968) with the same preset frequency as rheum.

So what does this mean for Linguistic DNA?

Again, it is possible to intervene with instances like rewme, whether through the training interface or by manipulating the frequencies. But it is evident that the scale of intervention required is considerable, and it is not obvious that telling VARD that rewme is rheum about 90% of the time that it occurs in EEBO-TCP and realm 10% of the time will have any impact in helping the auto-normalisation process to know when and where to distribute the forms in practice.

The frustrating thing is that the distribution is predictable: in a political text, it is normally “realm”; and in a medical text, it is “rheum”. But VARD seems to have no mechanism to recognise or respond to the contexts of discourse that would so quickly become visible with topic modelling. (Consider the clustering of the four humours in early modern medicine, for example.) I have a feeling this would be where SAMUELS and the Historical Thesaurus come in… if only SAMUELS didn’t rely on VARD’s prior intervention!

Wordcloud image created with Wordle.


Output from MorphAdorner for "Vrine doeth"

Illustrating the tools: first insights on VARD & MorphAdorner

In 2015, we compared two tools developed to address spelling variation in early modern English: VARD and MorphAdorner. This post documents some of that work, outlining how the design and intent of the two tools affects their impact.

The Sheffield RAs are hard at work on our audit of Early English Books Online, figuring out how best to clean up the TCP data for Linguistic DNA’s research goals. In the last post, Seth documented our intention to try out 2 of the tools that have evolved as a method of analysing the data: VARD and MorphAdorner. What we wrote there was based on our reading.  So what have we found in practice?

VARD: Spell-checking early modern English

Give VARD an early modern text and it will do two things: spell-check and modernise.

When a word matches a word form in current English, VARD will move on to the next word.

Standard historic verb forms such as “leaveth” and “believest” will be modernised (“leaves”, “believe”). Some archaisms survive this process, e.g. “hath”. The second person singular pronoun “thou” is also allowed to stand, though this can result in some ungrammatical combinations; e.g. a VARD-ed text yielding “thou believe”.

For non-standard or unrecognised spellings, VARD identifies and calculates the probabilities of a range of possible solutions. It implements the ‘best’ change automatically (at the user’s instigation) only if its confidence exceeds the chosen parameters. For our samples, we have found that setting the F-Score at 2 (a slight increase in recall) achieves optimal results with the default threshold for auto-normalisation (50%). For the 1596 pamphlet, The nature of Kauhi or Coffe, and the Berry of which it is made, Described by an Arabian Phisitian, this setting automatically resolves “Coffee” (56.89% confidence) and “Physician” (76%).

VARD’s interventions are not always so effective. Continuing with the coffee text (EEBO-TCP A37215) we observe that VARD also amends the Arabic month name Ab to Obe, a noun referring to villages in ancient Laconia, (50.12%). This weakness reflects the fact that VARD operates with lexicons from a specified language (in this case modern English) and measures the appropriateness of solutions only within that language.1 Another problematic amendment in the sample is the substitution of “Room” for “Rewme” (actually a variant of the noun ‘rheum’).

In the four texts we have sampled, VARD introduces some important corrections. But it also introduces errors, resolving “Flix” (Flux) as “Flex”, “Pylate” (Pilate) to “Pilot”, and “othe” (oath) to “other”. Each such intervention creates one false positive and one false negative when the texts as a whole are counted and indexed.

The false positive/negative dilemma also presents when an early modern word form matches the standard form of a different word in modern English. The verbs “do” and “be” appear frequently in EEBO-TCP with a superfluous terminal “e”. To the uninitiated, an index of VARDed EEBO-TCP might suggest heightened interest in deer-hunting and honey-production in the latter quarter of the sixteenth century.2


MorphAdorner is set up for more extensive linguistic processing than VARD. From the basic input (the EEBO-TCP form), it tags or adorns that input with a minimum of five outputs.

  • A unique xml:id
  • An ‘actual spelling’ (spe)3
  • A regularised spelling (reg)
  • A part-of-speech tag (using the NUPOS tag set)
  • A lemma

Initially, MorphAdorner itemises the XML input creating a reference system that allows changes to be mapped and revisited. This is reflected in the xml:id.

The pros and cons of such output and the actual performance of MorphAdorner under observation is better understood when illustrated with some examples.

Output from MorphAdorner for "Vrine doeth"

Figure 1

As shown in Figure 1, in a medical text (A13300) “Vrine” is regularised to “Urine”, identified as a singular noun (n1) and indexed by the lemma “urine”. The verbal form “doeth” is regularised as “doth” and lemmatised as “do”. Its part of speech tag (vdz) designates “3rd singular present, ‘do'”.

Where the word form has been broken up by the TCP processing—whether because of a line-break in the original text or some other markup reflecting the format of the printed page—MorphAdorner is able to calculate and tokenise the whole word. So at the start of the coffee pamphlet, the first letter is printed with a decorated initial (“B”); to interpret it as part of the word form “Bun” (the name of the coffee plant in Arabic) means joining these portions together as ‘parts’ of a single word that then becomes a candidate for regularisation and lemmatisation (see Figure 2).

MorphAdorner output for "BUN" (A37215)

Figure 2

(MorphAdorner’s solution here is not ideal because it does not recognise the word as ‘foreign’. The lack of any preceding determinative means the word is taken as a proper noun which minimises the disruption of this false positive.)

It is noteworthy that MorphAdorner resolves “it selfe” into “itselfe” and thus regularised and lemmatised “itself”. This is not something VARD can achieve. However, the problem of collocations that were written in separate parts but have become compounds in modern English is more profound. Both MorphAdorner and VARD process “straighte way” (A13300) as “straight way”, “anye thinge” (A04975) as “any thing”, and leave “in soemuch” (A13300) untouched; counting only these single instances results in half-a-dozen false positives and three false negatives.

There are other items that cause problems but the most striking outcome of the comparison so far is a set of failures in the regularisation process. Three examples are given below:

  • “sight” is regularised to “sighed” (8 occurrences in A13300)

MorphAdorner output for "sight": "sighed"


  • “hot” is regularised to “hight” (3 occurrences in A37215; 50 occurrences in A13300)

MorphAdorner output for "hot": regularised as "hight"


  • “an” is regularised to “and” (2 occurrences in A37215; 276 occurrences in A13300)

Morphadorner output for "an": regularised as "and"


In each case the original spelling is an ordinary English word (something VARD would leave untouched) being used normally and the token is lemmatized correctly, but the regularisation entry represents an entirely separate English word. These problems came up in the small sections we sampled. That they reflect a problem with MorphAdorner rather than our application of it is evident from a search for “hot” and “hight” using the “regularized spellings” option via the Early Modern Print interface.

In effect, this indicates that MorphAdorner’s lemmatization may be significantly more reliable than the regularisation. It also means we may need to continue experimenting with VARD and MorphAdorner, and find automated ways—using comparisons—to look for other similarly ‘bugged’ terms.



1 It is possible to instruct VARD that a particular word or passage is from another language, and even to have it VARDed in that language (where a wordlist is available) but this requires human intervention at a close level.
2 A CQPweb query showed “doe” exceeding 6 instances per million words (ipmw) in 1575–1599, with “bee” at 88.21 ipmw in 1586—the latter disproportionately influenced by the 265 instances in Thomas Wilcox’s Right godly and learned exposition, vpon the whole booke of Psalms (≈ 704 ipmw).
3 In fact this already incorporates some replacement rules, so that “vrine” may be given the spe entry “urine” (cf. xml:id A13300_headed-018250); less productively, the rules also change “iuices” to “ivices” (A13300_headed-036790). Logically creating such permutations increases the possibility that MorphAdorner will detect a direct match in its next step. It may be that the “spe” field is an ‘actual or alternative’ spelling field.

Texts sampled:

The extracts analysed comprised between 200 and 400 tokens from each of the following EEBO-TCP documents:

  • A04975: The pleasaunt playne and pythye pathewaye leadynge to a vertues and honest lyfe no lesse profytable, then delectable. V.L. [Valentine Leigh] London, ?1552. (STC 15113.5)
  • A13300: A rich store-house or treasury for the diseased Wherein, are many approued medicines for diuers and sundry diseases, which haue been long hidden, and not come to light before this time. Now set foorth for the great benefit and comfort of the poorer sort of people that are not of abilitie to go to the physitions. A.T. London, 1596. (STC 23606)
  • A14136: The obedie[n]ce of a Christen man and how Christe[n] rulers ought to governe…., William Tyndale, [Antwerp] 1528. (STC 24446)
  • A37215: The nature of the drink kauhi, or coffe, and the berry of which it is made described by an Arabian phisitian. English & Arabic. [Anṭākī, Dāʼūd ibn ʻUmar, d. 1599.] Oxford, 1659. (Wing D374)

Welcome to the Linguistic DNA blog!

Linguistic DNA cloud (created with Tagul)

The Linguistic DNA blog is a space for those working on the project to reflect on methodology, findings, and other aspects of the project in an informal way.

Fraser, Iona, and Seth (the research associates) will be taking it in turns to share what we have been working on.  At present, the website is gradually taking shape thanks to the Sheffield team, while Fraser is hard at work drafting conference papers.

Our chain of Linguistic DNA (image above) has been generated from the initial text of our website, using Tagul.