Tag Archives: Work-in-progress

Word-cloud for this blog post (generated with Wordle)

Liest thou, or hast a Rewme? Getting the best from VARD and EEBO

This post from August 2015 continues the comparison of VARD and MorphAdorner, tools for tackling spelling variation in early modern English. (See earlier posts here and here.) As of 2018, data on our public interface was prepared with an updated version of MorphAdorner and some additional curation from Martin Mueller at NorthWestern.


This week, we’ve replaced the default VARD set-up with a version designed to optimise the tools for VARD. In essence, this includes a lengthier set of rules to guide the changing of letters, and lists of words and variants that are more suited to the early modern corpus.

It is important to bear in mind that the best use of VARD involves someone ‘training’ it, supervising and to a large extent determining the correct substitutions. But because Linguistic DNA is tackling the whole EEBO-TCP corpus, and the mass of documents within it is far from homogenous, it is difficult to optimise it effectively.

Doth VARD recognise the second-person singular?

A first effort with the EEBO set-up was to review the understanding formed about how VARD works in relation to verb conjugations for the second and third persons singular. A custom text was written to test the output (using the 50% threshold for auto-normalisation as previously):

If he lieth thou liest. When she believeth thou leavest. 
If thou believest not, he leaveth. Where hast thou been? 
When hadst thou gone? Where hath he walked? 
Where goest thou? Where goeth he?
What doth he? What doeth he? What dost thou? 
What doest thou? What ist? What arte doing?

Most of the forms were modernised just as described in the previous post. However, some of the output gave cause for concern. In the first sentence, “liest” became “least”. Further on “goest” became “goosed”, “doest” was accepted as a non-variant, while both “hast” and “dost” were highlighted as unresolved variants. This output can be explained by looking at the variant and word lists and the statistical measures VARD uses.

VARD’s use of variants and word frequencies

Scrutinising the word and variant lists within the EEBO set-up showed that although the variant list recorded “doest” as an instance of “dost”, “doest” and not “dost” appeared in the word list, overriding that variant. Similarly, “ha’st” appears in the variant list as a form of “hast”, but “hast” is not in the word list. It is not difficult to add items to the word list, but the discrepancies in the list contents are surprising. In fact, it might be more appropriate for VARD to record “doest” as a variant of “do”, and “ha’st” of “have”.

For “liest”, the correct variant and word entries are present so that “liest” can be amended to “lie”, giving a known variant [KV] recall score of 100% (indicating this is not a known variant form of any other word). However, the default parameters (regardless of the F-score) favour “least” because that amendment strongly satisfies the other three criteria: letter replacement [LR] (the rules), phonetic matching [PM], and edit distance [ED]. Until human judgment intervenes with the weighting, “least” has the better statistical case. (Much the same applies to “goest” and “goosed”.)

In VARD’s defence, one need only intervene with any of the “-st” verb endings in the text once (before triggering the auto-normalisation process) for the weighting to shift in favour of “liest”. VARD learns well.

Rewme: space, cold, or dominion?

One of the ‘authentic’ EEBO extracts we’ve been testing with is taken from a medical text, A rich store-house or treasury for the diseased, 1596 (TCP A13300). As mentioned in a previous post, employing VARD’s automated normalisation with the default 50% threshold, references to “Rewme” becomes “Room”. Looking again at what is happening beneath the surface, the first surprise is that there is an entry for “rewme” in the variant list, specifying it as a known variant of “room”. This is unsatisfying with regard to EEBO-TCP: a search of the corpus shows that the word form “rewme” appears in 89 texts. Viewing each instance through Anupam Basu’s keyword-in-context interface shows that in 84 texts, “rewme” is used with the meaning “rheum”. Of the other five texts, one is Middle English biblical exegesis (attributed to John Purvey); committed to print as late as 1550, the text repeatedly uses “rewme” with the sense “realm” or “kingdom” (both earthly and divine). The remaining four were printed after 1650 and are either political or historical in intent, similarly using “rewme” as a spelling of “realm”. Nowhere in EEBO-TCP does “rewme” appear with the sense “room”. However, removing it from the known variants (by setting its value to zero) and adding new variant entries for realm and rheum does not result in the desired auto-normalisation: The fact that both realm and rheum are candidates means their KV recall score is halved (50%). At the same time, the preset frequencies strengthen room’s position (309) compared with realm (80) and rheum (50). In fact, the word list accompanying the EEBO set-up seems still to be based on the BNC corpus—featuring robotic (OED 1928) and pulsar (OED 1968) with the same preset frequency as rheum.

So what does this mean for Linguistic DNA?

Again, it is possible to intervene with instances like rewme, whether through the training interface or by manipulating the frequencies. But it is evident that the scale of intervention required is considerable, and it is not obvious that telling VARD that rewme is rheum about 90% of the time that it occurs in EEBO-TCP and realm 10% of the time will have any impact in helping the auto-normalisation process to know when and where to distribute the forms in practice.

The frustrating thing is that the distribution is predictable: in a political text, it is normally “realm”; and in a medical text, it is “rheum”. But VARD seems to have no mechanism to recognise or respond to the contexts of discourse that would so quickly become visible with topic modelling. (Consider the clustering of the four humours in early modern medicine, for example.) I have a feeling this would be where SAMUELS and the Historical Thesaurus come in… if only SAMUELS didn’t rely on VARD’s prior intervention!


Wordcloud image created with Wordle.

 

Output from MorphAdorner for "Vrine doeth"

Illustrating the tools: first insights on VARD & MorphAdorner

In 2015, we compared two tools developed to address spelling variation in early modern English: VARD and MorphAdorner. This post documents some of that work, outlining how the design and intent of the two tools affects their impact.


The Sheffield RAs are hard at work on our audit of Early English Books Online, figuring out how best to clean up the TCP data for Linguistic DNA’s research goals. In the last post, Seth documented our intention to try out 2 of the tools that have evolved as a method of analysing the data: VARD and MorphAdorner. What we wrote there was based on our reading.  So what have we found in practice?

VARD: Spell-checking early modern English

Give VARD an early modern text and it will do two things: spell-check and modernise.

When a word matches a word form in current English, VARD will move on to the next word.

Standard historic verb forms such as “leaveth” and “believest” will be modernised (“leaves”, “believe”). Some archaisms survive this process, e.g. “hath”. The second person singular pronoun “thou” is also allowed to stand, though this can result in some ungrammatical combinations; e.g. a VARD-ed text yielding “thou believe”.

For non-standard or unrecognised spellings, VARD identifies and calculates the probabilities of a range of possible solutions. It implements the ‘best’ change automatically (at the user’s instigation) only if its confidence exceeds the chosen parameters. For our samples, we have found that setting the F-Score at 2 (a slight increase in recall) achieves optimal results with the default threshold for auto-normalisation (50%). For the 1596 pamphlet, The nature of Kauhi or Coffe, and the Berry of which it is made, Described by an Arabian Phisitian, this setting automatically resolves “Coffee” (56.89% confidence) and “Physician” (76%).

VARD’s interventions are not always so effective. Continuing with the coffee text (EEBO-TCP A37215) we observe that VARD also amends the Arabic month name Ab to Obe, a noun referring to villages in ancient Laconia, (50.12%). This weakness reflects the fact that VARD operates with lexicons from a specified language (in this case modern English) and measures the appropriateness of solutions only within that language.1 Another problematic amendment in the sample is the substitution of “Room” for “Rewme” (actually a variant of the noun ‘rheum’).

In the four texts we have sampled, VARD introduces some important corrections. But it also introduces errors, resolving “Flix” (Flux) as “Flex”, “Pylate” (Pilate) to “Pilot”, and “othe” (oath) to “other”. Each such intervention creates one false positive and one false negative when the texts as a whole are counted and indexed.

The false positive/negative dilemma also presents when an early modern word form matches the standard form of a different word in modern English. The verbs “do” and “be” appear frequently in EEBO-TCP with a superfluous terminal “e”. To the uninitiated, an index of VARDed EEBO-TCP might suggest heightened interest in deer-hunting and honey-production in the latter quarter of the sixteenth century.2

MorphAdorner

MorphAdorner is set up for more extensive linguistic processing than VARD. From the basic input (the EEBO-TCP form), it tags or adorns that input with a minimum of five outputs.

  • A unique xml:id
  • An ‘actual spelling’ (spe)3
  • A regularised spelling (reg)
  • A part-of-speech tag (using the NUPOS tag set)
  • A lemma

Initially, MorphAdorner itemises the XML input creating a reference system that allows changes to be mapped and revisited. This is reflected in the xml:id.

The pros and cons of such output and the actual performance of MorphAdorner under observation is better understood when illustrated with some examples.

Output from MorphAdorner for "Vrine doeth"

Figure 1

As shown in Figure 1, in a medical text (A13300) “Vrine” is regularised to “Urine”, identified as a singular noun (n1) and indexed by the lemma “urine”. The verbal form “doeth” is regularised as “doth” and lemmatised as “do”. Its part of speech tag (vdz) designates “3rd singular present, ‘do'”.

Where the word form has been broken up by the TCP processing—whether because of a line-break in the original text or some other markup reflecting the format of the printed page—MorphAdorner is able to calculate and tokenise the whole word. So at the start of the coffee pamphlet, the first letter is printed with a decorated initial (“B”); to interpret it as part of the word form “Bun” (the name of the coffee plant in Arabic) means joining these portions together as ‘parts’ of a single word that then becomes a candidate for regularisation and lemmatisation (see Figure 2).

MorphAdorner output for "BUN" (A37215)

Figure 2

(MorphAdorner’s solution here is not ideal because it does not recognise the word as ‘foreign’. The lack of any preceding determinative means the word is taken as a proper noun which minimises the disruption of this false positive.)

It is noteworthy that MorphAdorner resolves “it selfe” into “itselfe” and thus regularised and lemmatised “itself”. This is not something VARD can achieve. However, the problem of collocations that were written in separate parts but have become compounds in modern English is more profound. Both MorphAdorner and VARD process “straighte way” (A13300) as “straight way”, “anye thinge” (A04975) as “any thing”, and leave “in soemuch” (A13300) untouched; counting only these single instances results in half-a-dozen false positives and three false negatives.

There are other items that cause problems but the most striking outcome of the comparison so far is a set of failures in the regularisation process. Three examples are given below:

  • “sight” is regularised to “sighed” (8 occurrences in A13300)

MorphAdorner output for "sight": "sighed"

 

  • “hot” is regularised to “hight” (3 occurrences in A37215; 50 occurrences in A13300)

MorphAdorner output for "hot": regularised as "hight"

 

  • “an” is regularised to “and” (2 occurrences in A37215; 276 occurrences in A13300)

Morphadorner output for "an": regularised as "and"

 

In each case the original spelling is an ordinary English word (something VARD would leave untouched) being used normally and the token is lemmatized correctly, but the regularisation entry represents an entirely separate English word. These problems came up in the small sections we sampled. That they reflect a problem with MorphAdorner rather than our application of it is evident from a search for “hot” and “hight” using the “regularized spellings” option via the Early Modern Print interface.

In effect, this indicates that MorphAdorner’s lemmatization may be significantly more reliable than the regularisation. It also means we may need to continue experimenting with VARD and MorphAdorner, and find automated ways—using comparisons—to look for other similarly ‘bugged’ terms.

 


Notes

1 It is possible to instruct VARD that a particular word or passage is from another language, and even to have it VARDed in that language (where a wordlist is available) but this requires human intervention at a close level.
2 A CQPweb query showed “doe” exceeding 6 instances per million words (ipmw) in 1575–1599, with “bee” at 88.21 ipmw in 1586—the latter disproportionately influenced by the 265 instances in Thomas Wilcox’s Right godly and learned exposition, vpon the whole booke of Psalms (≈ 704 ipmw).
3 In fact this already incorporates some replacement rules, so that “vrine” may be given the spe entry “urine” (cf. xml:id A13300_headed-018250); less productively, the rules also change “iuices” to “ivices” (A13300_headed-036790). Logically creating such permutations increases the possibility that MorphAdorner will detect a direct match in its next step. It may be that the “spe” field is an ‘actual or alternative’ spelling field.


Texts sampled:

The extracts analysed comprised between 200 and 400 tokens from each of the following EEBO-TCP documents:

  • A04975: The pleasaunt playne and pythye pathewaye leadynge to a vertues and honest lyfe no lesse profytable, then delectable. V.L. [Valentine Leigh] London, ?1552. (STC 15113.5)
  • A13300: A rich store-house or treasury for the diseased Wherein, are many approued medicines for diuers and sundry diseases, which haue been long hidden, and not come to light before this time. Now set foorth for the great benefit and comfort of the poorer sort of people that are not of abilitie to go to the physitions. A.T. London, 1596. (STC 23606)
  • A14136: The obedie[n]ce of a Christen man and how Christe[n] rulers ought to governe…., William Tyndale, [Antwerp] 1528. (STC 24446)
  • A37215: The nature of the drink kauhi, or coffe, and the berry of which it is made described by an Arabian phisitian. English & Arabic. [Anṭākī, Dāʼūd ibn ʻUmar, d. 1599.] Oxford, 1659. (Wing D374)

Welcome to the Linguistic DNA blog!

Linguistic DNA cloud (created with Tagul)

The Linguistic DNA blog is a space for those working on the project to reflect on methodology, findings, and other aspects of the project in an informal way.

Fraser, Iona, and Seth (the research associates) will be taking it in turns to share what we have been working on.  At present, the website is gradually taking shape thanks to the Sheffield team, while Fraser is hard at work drafting conference papers.

Our chain of Linguistic DNA (image above) has been generated from the initial text of our website, using Tagul.