Tag Archives: MorphAdorner

Errors, searchability, and experiments with Thomason’s Newsbooks

Back in 2012, HRI Digital ran a project, with the departments of English, History, and Sociological Studies, looking at participatory search design. The project took as its focus a subset of George Thomason’s 17th-century newsbooks, transcribing every issue of Mercurius Politicus plus the full selection of newsbooks published in 1649 (from the images available through ProQuest’s Early English Books Online). Building the interactive interface, the Newsbooks project focused on how researchers interact with (and want to interact with) such historical texts. Thus, for example, search results may feature texts published at the same point in time. A problem not resolved in the original phase was variant spellings, and the humanities investigators held onto concerns about (in)accuracy in the transcriptions.

The tools tried out for Linguistic DNA have provided a fresh mechanism to improve the Newsbooks’ searchability. Sheffield MA student Amy Jackson recently completed a 100-hour work placement investigating how a MorphAdorned version of the Newsbooks could inform questions about the accuracy of transcriptions, and how a statistically-organised representation of the language data (an early output of LDNA’s processor) affects understanding of the content and context of Thomason’s collection.

Amy reports:

My main task during my placement has been to find errors within the newsbooks, both printing and transcription errors, in order to improve the searchability of the newsbooks. I’ve been using methods such as checking hapax legomena (words that only occur once within a text or collection of texts) and Pointwise Mutual Information (PMI).

Note from the editors: PMI measures word associations by comparing observed cooccurrences with what might be expected in a random wordset (based on the same data).
—Expect a blog post on this soon!

The hurried composition of the newsbooks causes problems for searchability. It seems those printing the newsbooks were less concerned with accuracy than those who were printing books. This can be seen in several examples that I have found while searching through the hapax legomena. For example, on one occasion ‘transmitted’ is printed as ‘trasmitte4’ with a ‘4’ being used as a substitute for a missing ‘d’ (see image above). Elsewhere the number ‘8’ is used as a substitute for a capital ‘S’, printing ‘Sea’ as ‘8ea’. Such printing decisions present a specialised problem for searches because they are unusual. Knowing this characteristic (replacing letters with numbers) means one can look at modifying search rules to improve the ‘success’ in finding relevant information.

High PMI values can also be used to find unusual words or word pairs that aren’t errors. While I was searching through the high PMI values I came across the word ‘King-chopper’ – used as an insult to refer to Colonel John ‘Tinker’ Fox who was falsely rumoured to be one of King Charles I’s executioners in 1649. The Man in the Moon, the newsbook in which the reference appears, was printed by John Crouch. Crouch was a Royalist journalist who was arrested and imprisoned for printing The Man in the Moon after the King’s death.

Mid-range PMI values are useful for understanding how language was used in the newsbooks. ‘Loyal’ often co-occurs with words such as ‘crown’, ‘royalist’, ‘sovereign’, ‘majesty’, ‘Charles’, ‘usurp’, and ‘treason’. This implies that the word ‘loyal’ is mainly being used by Royalist newsbooks in 1649 rather than Parliamentarian newsbooks. If I had more time I would look more closely at the differences in the language used by Royalist and Parliamentarian newsbooks.

PMI and hapax legomena have been useful for finding errors within the newsbooks but they have mainly provided an interesting way for me to interact with the texts. The PMI data often encouraged me to research the newsbooks and the people who printed them further and hapax legomena have provided useful insights into how the newsbooks were printed in 1649.

Illustrating the tools: first insights on VARD & MorphAdorner

In 2015, we compared two tools developed to address spelling variation in early modern English: VARD and MorphAdorner. This post documents some of that work, outlining how the design and intent of the two tools affects their impact.

The Sheffield RAs are hard at work on our audit of Early English Books Online, figuring out how best to clean up the TCP data for Linguistic DNA’s research goals. In the last post, Seth documented our intention to try out 2 of the tools that have evolved as a method of analysing the data: VARD and MorphAdorner. What we wrote there was based on our reading. So what have we found in practice?

VARD: Spell-checking early modern English

Give VARD an early modern text and it will do two things: spell-check and modernise.

When a word matches a word form in current English, VARD will move on to the next word.

Standard historic verb forms such as “leaveth” and “believest” will be modernised (“leaves”, “believe”). Some archaisms survive this process, e.g. “hath”. The second person singular pronoun “thou” is also allowed to stand, though this can result in some ungrammatical combinations; e.g. a VARD-ed text yielding “thou believe”.

For non-standard or unrecognised spellings, VARD identifies and calculates the probabilities of a range of possible solutions. It implements the ‘best’ change automatically (at the user’s instigation) only if its confidence exceeds the chosen parameters. For our samples, we have found that setting the F-Score at 2 (a slight increase in recall) achieves optimal results with the default threshold for auto-normalisation (50%). For the 1596 pamphlet, The nature of Kauhi or Coffe, and the Berry of which it is made, Described by an Arabian Phisitian, this setting automatically resolves “Coffee” (56.89% confidence) and “Physician” (76%).

VARD’s interventions are not always so effective. Continuing with the coffee text (EEBO-TCP A37215) we observe that VARD also amends the Arabic month name Ab to Obe, a noun referring to villages in ancient Laconia, (50.12%). This weakness reflects the fact that VARD operates with lexicons from a specified language (in this case modern English) and measures the appropriateness of solutions only within that language.¹ Another problematic amendment in the sample is the substitution of “Room” for “Rewme” (actually a variant of the noun ‘rheum’).

In the four texts we have sampled, VARD introduces some important corrections. But it also introduces errors, resolving “Flix” (Flux) as “Flex”, “Pylate” (Pilate) to “Pilot”, and “othe” (oath) to “other”. Each such intervention creates one false positive and one false negative when the texts as a whole are counted and indexed.

The false positive/negative dilemma also presents when an early modern word form matches the standard form of a different word in modern English. The verbs “do” and “be” appear frequently in EEBO-TCP with a superfluous terminal “e”. To the uninitiated, an index of VARDed EEBO-TCP might suggest heightened interest in deer-hunting and honey-production in the latter quarter of the sixteenth century.²

MorphAdorner

MorphAdorner is set up for more extensive linguistic processing than VARD. From the basic input (the EEBO-TCP form), it tags or adorns that input with a minimum of five outputs.

A unique xml:id
An ‘actual spelling’ (spe)³
A regularised spelling (reg)
A part-of-speech tag (using the NUPOS tag set)
A lemma

Initially, MorphAdorner itemises the XML input creating a reference system that allows changes to be mapped and revisited. This is reflected in the xml:id.

The pros and cons of such output and the actual performance of MorphAdorner under observation is better understood when illustrated with some examples.

Output from MorphAdorner for "Vrine doeth"

Figure 1

As shown in Figure 1, in a medical text (A13300) “Vrine” is regularised to “Urine”, identified as a singular noun (n1) and indexed by the lemma “urine”. The verbal form “doeth” is regularised as “doth” and lemmatised as “do”. Its part of speech tag (vdz) designates “3rd singular present, ‘do'”.

Where the word form has been broken up by the TCP processing—whether because of a line-break in the original text or some other markup reflecting the format of the printed page—MorphAdorner is able to calculate and tokenise the whole word. So at the start of the coffee pamphlet, the first letter is printed with a decorated initial (“B”); to interpret it as part of the word form “Bun” (the name of the coffee plant in Arabic) means joining these portions together as ‘parts’ of a single word that then becomes a candidate for regularisation and lemmatisation (see Figure 2).

Figure 2

(MorphAdorner’s solution here is not ideal because it does not recognise the word as ‘foreign’. The lack of any preceding determinative means the word is taken as a proper noun which minimises the disruption of this false positive.)

It is noteworthy that MorphAdorner resolves “it selfe” into “itselfe” and thus regularised and lemmatised “itself”. This is not something VARD can achieve. However, the problem of collocations that were written in separate parts but have become compounds in modern English is more profound. Both MorphAdorner and VARD process “straighte way” (A13300) as “straight way”, “anye thinge” (A04975) as “any thing”, and leave “in soemuch” (A13300) untouched; counting only these single instances results in half-a-dozen false positives and three false negatives.

There are other items that cause problems but the most striking outcome of the comparison so far is a set of failures in the regularisation process. Three examples are given below:

“sight” is regularised to “sighed” (8 occurrences in A13300)

“hot” is regularised to “hight” (3 occurrences in A37215; 50 occurrences in A13300)

“an” is regularised to “and” (2 occurrences in A37215; 276 occurrences in A13300)

In each case the original spelling is an ordinary English word (something VARD would leave untouched) being used normally and the token is lemmatized correctly, but the regularisation entry represents an entirely separate English word. These problems came up in the small sections we sampled. That they reflect a problem with MorphAdorner rather than our application of it is evident from a search for “hot” and “hight” using the “regularized spellings” option via the Early Modern Print interface.

In effect, this indicates that MorphAdorner’s lemmatization may be significantly more reliable than the regularisation. It also means we may need to continue experimenting with VARD and MorphAdorner, and find automated ways—using comparisons—to look for other similarly ‘bugged’ terms.

Notes

^{1 It is possible to instruct VARD that a particular word or passage is from another language, and even to have it VARDed in that language (where a wordlist is available) but this requires human intervention at a close level.}
^{2 A CQPweb query showed “doe” exceeding 6 instances per million words (ipmw) in 1575–1599, with “bee” at 88.21 ipmw in 1586—the latter disproportionately influenced by the 265 instances in Thomas Wilcox’s Right godly and learned exposition, vpon the whole booke of Psalms (≈ 704 ipmw).}
^{3 In fact this already incorporates some replacement rules, so that “vrine” may be given the spe entry “urine” (cf. xml:id A13300_headed-018250); less productively, the rules also change “iuices” to “ivices” (A13300_headed-036790). Logically creating such permutations increases the possibility that MorphAdorner will detect a direct match in its next step. It may be that the “spe” field is an ‘actual or alternative’ spelling field.}

Texts sampled:

The extracts analysed comprised between 200 and 400 tokens from each of the following EEBO-TCP documents:

A04975: The pleasaunt playne and pythye pathewaye leadynge to a vertues and honest lyfe no lesse profytable, then delectable. V.L. [Valentine Leigh] London, ?1552. (STC 15113.5)
A13300: A rich store-house or treasury for the diseased Wherein, are many approued medicines for diuers and sundry diseases, which haue been long hidden, and not come to light before this time. Now set foorth for the great benefit and comfort of the poorer sort of people that are not of abilitie to go to the physitions. A.T. London, 1596. (STC 23606)
A14136: The obedie[n]ce of a Christen man and how Christe[n] rulers ought to governe…., William Tyndale, [Antwerp] 1528. (STC 24446)
A37215: The nature of the drink kauhi, or coffe, and the berry of which it is made described by an Arabian phisitian. English & Arabic. [Anṭākī, Dāʼūd ibn ʻUmar, d. 1599.] Oxford, 1659. (Wing D374)

EEBO-TCP and standard spelling

This post from 2015 outlines the challenge posed by non-standard spelling in early modern English with particular attention to Early English Books Online. It introduces two tools developed by others in order to assist searching and other language-based research: VARD and MorphAdorner.

The Linguistic DNA project relies on two very large linguistic data sources for evidence of semantic and conceptual change from c.1500 to c.1800—Early English Books Online Text Creation Partnership dataset (EEBO-TCP),and Gale Cengage’s Eighteenth Century Collections Online (ECCO).* The team has begun by digging into EEBO-TCP, assessing the data (and its dirtiness), and planning how to process it with all of its imperfections.

Early Modern English orthography is far from standardised, though standardisation increases considerably towards the end of the period in question. One of the goals of the EEBO-TCP project is to faithfully represent Early Modern English artefacts in digital form as both image scans and keyed texts. For Early Modernists, working with orthographic variation in such faithful transcriptions is no surprise. However, many users of EEBO-TCP, particularly public-facing users such as librarians, noted from the beginning that an average searcher might have difficulty with the number of false negatives returned by a search query—i.e. the number of instances of a word that the search interface fails to find due to their non-standard forms.

The orthographic standardisation that is part of a day’s work for Early Modernists is no small feat for computers. On the other hand, counting very large numbers of data points in very large data sets, and doing so very quickly, is exactly what computers are good at. Computers just need to be given clear and complete instructions on what to count (instructions provided by programmers with some help from Early Modernists).

ProQuest addressed the issue of spelling variation in their Chadwyck EEBO-TCP web interface with VosPos (Virtual Orthographic Standardisation and Part Of Speech). VosPos was developed at Northwestern University, based on research by Prof. Martin Mueller and the staff of the Academic Technologies group. Among other things, VosPos identifies a part of speech and lemma for each textual word, and matches each textual word to a standard spelling. Users searching EEBO-TCP for any given word using a standard spelling can thus retrieve all instances of non-standard spellings and standard or non-standard inflectional forms as well.

Querying EEBO-TCP for ‘Linguistic DNA’

Our project aims to analyse the lexis in the entire EEBO dataset in various ways, all of which depend on our ability to identify a word in all of its various spellings and inflections. While the VosPos web interface is extremely useful for online lexical searches, it’s not the tool for the task we’ve set ourselves. So, we began by sorting through a sample of EEBO-TCP XML files, cataloguing some of the known, recurring issues in both spelling and transcription in the dataset—not just the Early Modern substitutability of v for u, for example, but also EEBO-TCP transcription practices such as using the vertical line character (|) to represent line breaks within a word. We quickly came to two conclusions: First, we weren’t going to build a system for automatically standardising the variety of orthographic and transcription practices in the EEBO data. Because second, someone else had already built such a system. Two someones, in fact, and two systems: VARD and MorphAdorner.

VARD (VARiant Detector)

VARD aims to standardise spelling in order to facilitate additional annotation by other means (such as Part-of-Speech (POS) tagging or semantic tagging). It uses spell-checking technology and allows for manual or automatic replacement of non-standard spellings with standard ones. VARD 1 was built on a word bank of over 40,000 known spelling variants for Early Modern English words. VARD 2 adds additional features: a lexicon composed of words that occur at least 50 times in the British National Corpus, and a lexicon composed of the Spell Checking Oriented Word ListVARD 2 also includes a rule bank of known Early Modern English letter substitutions, and a phonetic matching system based on Soundex. VARD identifies non-standard spellings and then suggests a standard spelling via the following steps: identifying known variants from the word bank; identifying possible letter replacements from the rule bank; identifying phonetically similar words via the phonetic matching algorithm; and, finally, calculating a normalised Levenshtein distance for the smallest number of letters that can be changed for the textual word to become the standard spelling. VARD learns which method is most effective over time for a given text or set of texts, and additional parameters (such as weighting for recall and precision, respectively) can be manually adjusted. VARD has already been incorporated into the SAMUELS semantic tagger by Alistair Baron at Lancaster University alongside our own team members at Glasgow University, Marc Alexander and Fraser Dallachy.

MorphAdorner

MorphAdorner, like VosPos, was developed at Northwestern University by a team including Martin Mueller and Philip Burns. MorphAdorner 2.0 was designed to provide light but significant annotation for EEBO-TCP in particular, expanded to an array of other digital texts, towards what Mueller calls a ‘book of English’, a highly searchable corpus covering the full history of the language, from which a variety of information could be extracted efficiently. To that end, MorphAdorner includes two tools for POS-tagging (a trigram tagger and a rule-based tagger), and incorporates POS data in its spelling standardisation. Word banks for spelling standardisation are drawn from the OED and Webster’s data, as well as from EEBO-TCP training data, adding up to several hundred thousand variant forms. Those word banks are supplemented by a rule bank in identifying appropriate alternates. MorphAdorner recommends a standard spelling via the following steps: applying all rules from the rule bank to determine if any result in a standard spelling or a spelling that matches a known variant in the rule bank; calculating edit distance for the resulting spellings for the smallest number of letters that can be changed to turn the textual word into the known variant or standard spelling; calculating a weighted string similarity between the original word and the known variants or standards, based on letter pair similarity, phonetic distance, and edit distance; identifying the POS of the original word and limiting the possible variants by POS; selecting the found spelling with the highest similarity.

Some of the transcription issues in the EEBO-TCP data are solved within the MorphAdorner pipeline before the spelling standardisation process begins, partly by using Abbot, another system designed at Northwestern University (by Mueller and Burns), which converts dissimilar XML files into a common form. Abbot can therefore be used to automatically convert some of the EEBO-TCP XML transcription norms into a form that is more easily readable by MorphAdorner. Logically, this clean-up should improve things for VARD too.

So, what’s the best tool for our job?

There was considerable discussion of both VARD and MorphAdorner at last month’s Early Modern Digital Agendas institute at the Folger Institute in Washington, DC. On Twitter, @EMDigital reported that each was built with its own set of assumptions; that Folger’s Early Modern Manuscripts Online is now considering which one to use; and that the Visualising English Print project used VARD for standardisation but may switch to MorphAdorner in the future. Each tool has already been used in a variety of ways, some quite unexpected: VARD has been used to orthographically standardise classical Portuguese, and MorphAdorner has been used to standardise variation in contemporary medical vocabulary.

What will work best for us? Given the absence of documented comparisons for the two tools, we’ve realised we need to investigate what we can do with each.

The team is now working through the following stages:

Pre-process a sample of EEBO-TCP transcriptions so that words are more readily identifiable for further processing. (This should strike out those vertical lines.)
Take the pre-processed sample and process it using VARD and MorphAdorner, respectively. This will require optimising parameters in VARD (f-score balances and confidence threshold).
Assess the resulting two annotated samples (the first ‘VARDed’ and the second ‘MorphAdorned’) in order to identify the strengths of each tool, and what benefits each might provide for the project.

We anticipate presenting the results of this process at the project’s methodological workshop at the University of Sussex in September, and will post updates on the blog as well.

Further Reading:

Basu, Anupam. 2014. Morphadorner v2.0: From access to analysis. Spense Review 44.1.8. http://www.english.cam.ac.uk/spenseronline/review/volume-44/441/digital-projects/morphadorner-v20-from-access-to-analysis. Accessed July, 2015.

Gadd, Ian. 2009. The Use and Misuse of Early English Books Online. Literature Compass 6 (3). 680-92.

Humanities Digital Workshop at Washington University in St. Louis. [nd]. Early Modern Print: Text Mining Early Printed English. http://earlyprint.wustl.edu/. Accessed July, 2015.

Mueller, Martin. 2015. Scalable Reading. [Blog]. https://scalablereading.northwestern.edu/. Accessed July, 2015.

* Initially, we planned to include the full body of Gale Cengage’s Eighteenth Century Collections Online (ECCO) in our analysis. A later post explains why most of ECCO was not reliable for our purposes. Our interface incorporates the small portion of ECCO that has been transcribed through the Text Creation Partnership (ECCO-TCP).