Category Archives: Blog Archive

Talk About Change: LDNA at Festival of the Mind

Last weekend, Linguistic DNA & friends took over the Spiegeltent in Sheffield city centre, as part of the University’s Festival of the Mind. Spiegeltents are a Belgian invention–tents decorated internally with mirrors, creating the perfect space to share myriad reflections.

Over the course of two hours, we hosted a performance of new writing that emerged from collaboration with Our Mel (a Sheffield-based social enterprise dedicated to exploring cultural identity) and novelist Désirée Reynolds. Each of the pieces performed have also been published as part of a limited edition anthology: “Talk About Change: Writing as Resistance”.

The Researchers’ Introduction outlines a little more of the process that culminated in some extraordinary writing (excerpted from the print anthology):

Talk About Change: Writing as Resistance

Funded by the University of Sheffield’s Festival of the Mind, our collaborative workshops used examples of early modern word use (from the Linguistic DNA project and related research) as a starting point to think about language use today. How can the past speak to the present? How might the present speak to the past

As reflected in the structure of this anthology, the workshops explored four central themes: diversity, feminism, immigration and race. These were selected by Annalisa and Désirée, who also provided the extra focus on “writing as resistance”. In each case, the Linguistic DNA researchers sought to introduce historic material that might prompt conversation about the themes—and perhaps even fuel the resistance. Some input drew on prior research (especially for feminism and immigration sessions, which drew on Iona’s thesis and engaged also with the 500 Reformations project). As often, it was a basic excursion into early modern material—with a beginners’ introduction to linguistics and studying meaning (courtesy of Seth)

The most inventive work happened when we brought this material into the open sessions

Together with all who attended the workshops, we compared the role of diversity in historic texts to its position in modern culture: what once characterised a multiplicity of opinion is now used paradoxically of something individual. We considered aspects of feminist debate before the word feminism existed, exploring how the power of virtue changed as men (mostly) discussed the role of women in sixteenth-century England. Using texts about strangers, we examined parallels between the way people wrote (and complained) about early modern outsiders and modern discourse about immigrants. We reflected on the roots of race, its links to kinship, descent, and community and the relationship between structures of language and structures of power.

In each session, novelist and creative-writing facilitator Désirée Reynolds recommended other writings to bring out different dimensions of the themes. Wide reading was encouraged, and what you will find in the pages that follow reflects the careful crafting of a range of experience and inspiration drawing on at least five centuries of language use.

It is Writing as Resistance.

It comes from Talking About Change.

If you would like a copy of the anthology (free!), you can register interest (first come, first served) by filling out a short Google form.

(You can also read some words from the Editor, over on the 500 Reformations website.)

Linguistic DNA at SRS 2018: Abstracts

Knowledge, truth and expertise: experiments with Early English Books Online

Wondering what Linguistic DNA is bringing to the Society for Renaissance Studies? Here are the abstracts for two panels of papers, and information about our hands-on demonstration session (drop in).

United by a common interest in data-driven approaches to meaning and a focus on the transcribed portions of Early English Books Online (EEBO-TCP), this interdisciplinary panel brings together new research from the Linguistic DNA project and the Cambridge Concept Lab.

What is EEBO anyway? Contextual study of a universe in print
Iona Hine and Susan Fitzmaurice (University of Sheffield)

Since 2015, the Linguistic DNA team has been developing methods for mapping meaning and change-in-meaning in Early Modern English. Our work begins with the hypothesis that meanings are not equivalent with words, and can be invoked in many different ways. For example, when Early Modern writers discuss processes of democracy, there is no guarantee they will also employ a keyword such as democracy. We adopt a data-driven approach, using measures of frequency and proximity to track associations between words in texts over time. Strong patterns of co-occurrence between words allow us to build groups of words that collectively represent meanings-in-context (textual and historical). We term these groups “discursive concepts”.

The task of modelling discursive concepts in textual data has been absorbing and challenging, both theoretically and practically. Our main dataset, transcriptions of texts from Early English Books Online (EEBO-TCP), contains more than 50 000 texts. These include 9000 single-page broadsheets and 162 volumes that span more than 1000 pages. There are 127 items printed pre-1500, and nearly 7000 from the 1690s. The process of analysis therefore requires us to think carefully about how best to control and report on this variation in data distribution.

One particular question that has arisen affects all who attempt to use EEBO: what is in it? To what extent is its material from pre-1500 similar in kind (genre, immediacy, etc.) to that of the messy 1550s (as the English throne shifted speedily between Edward VI and his siblings), the 1610s (era of Shakespeare and the King James Version), or the 1640s (when Civil War raged)? This paper is a sustained reflection on attempts to find out “What’s in EEBO?”

In the beginning was the word?
EEBO-TCP and another universe of meaning
Seth Mehl (University of Sheffield)

When a new idea is conceived, how does it find expression in language? Between 1450 and 1750, the English lexicon expanded dramatically, and literary scholars, philologists, linguists, and historians have sought to document and demonstrate the paths taken by key social and cultural vocabulary, charting the history of what would become key social and cultural ideas, discourses, and concepts. In such cases, the topic and language for investigation has been intuited on the basis of extended qualitative reading, and the objects of investigation tend to be individual words. With the advent of a searchable database of early modern texts, such intuitions can be tested at scale, and the initial object of inquiry can shift from individual words to relationships between sets of words.

What happens when we invert the traditional process, taking the thousands of texts digitised in EEBO-TCP and applying computational techniques to model language change independent of human intuition? Can such techniques indicate meaningful relationships between key words that human researchers had not intuited or observed? To what extent do observations founded on over 1 billion words of early modern English correspond to and diverge from what scholarly readers have already inferred? Is it possible to identify discourses around key ideas even when the apparently related key words are absent? Combining insights from the Keywords Project with tools developed by the Linguistic DNA project, this paper will explore how concept modelling can be applied to re-examine meaning in early modern texts.

Beyond Power Steering:
re-constituting structures of knowledge in 17th-century texts
John Regan (University of Cambridge)

One of the axioms of the Cambridge Concept Lab is that digital means of enquiry should provide qualitatively new kinds of knowledge, if we are to realise their full value. This is to say, that computation should not merely provide ‘power steering for the humanities’, but allow one to discover something different in kind about how knowledge was structured in the past.

Making good on this axiom necessitates judgements on the part of the user of digital technology about how to design one’s modes of address to (for example) natural language data sets such as Early English Books Online- TCP, in order that one is not only adding ‘power steering’ to existing, familiar types of enquiry. It also necessitates making decisions about when to come to rest at results (that is, when to cease enquiry); judgements of where digital data can be said to be producing discrete and unfamiliar forms of knowledge.

This paper will present tentative first signs of what the Cambridge Concept Lab believe are historically-discrete conceptual structures, based on data from the early seventeenth-century portion of EEBO-TCP. Two such structures will be described, one entitled ‘Mutual Dependence’, the other ‘Self-Consistency’. As will be shown, familiar forms of knowledge that are held and expressed in sentences and paragraphs, organised by grammar and understood by readers largely as explicit sense, may be contrasted with this evidence of qualitatively different conceptual structures in the textual record. While this paper does not set out to debunk existing theories of the structuration of knowledge and its transmission in the seventeenth century as have become established through centuries of close reading, it does seek to enrich our understanding of these traditions by attending to conceptual, and not exclusively semantic, thematic or rhetorical, structures.

It appears uncontroversial to assert that concepts are determining with regard to features of language use such as explicit and implicit semantic fields, theme, word order, and syntactic relations at the level of the sentence. Nevertheless, recognising that concepts have lexical and semantic extension is not the same as accepting that the two are identical in kind. This paper’s claims about conceptual structure will be based upon evidence from the early decades of seventeenth-century data from EEBO-TCP.

Our afternoon panel is a little depleted (by ill-health) but features Jose M. Cree (Sheffield) on Neologisms and the English reformation, Lucas van der Deijl (Amsterdam) on The collaborative Dutch translations of Descartes by Jan Hendrik Glazemaker (1620-1682), and a little extra time for discussion.

DROP-IN SESSION

All SRS delegates are very welcome to drop in to our demo workshop, where we will be providing a 10-15-minute introduction to our tools (3:30pm, repeated at 4:30pm) and the opportunity for hands-on experimentation. This is in the Hicks Building, Floor G, room 29. (About 2 minutes walk from Jessop West, across the main road and a little uphill. Directions.)

Snapshot from campus map, featuring the Hicks Building.

Talk About Change

In a time when events seem ever and ever out of our control, writing is resistance.
–Our Mel.

In April (2018), Linguistic DNA began collaborating with local social entrepreneurs Our Mel to do some collective thinking about the power of language. This work is funded by the University of Sheffield’s Festival of the Mind and our work together will culminate in a spoken-word performance in the Festival’s Spiegeltent (pictured) this September.

The collaboration also involves 500 Reformations: exploring stories of change, from 1517 to 2018, a University of Sheffield public engagement project headed up by Linguistic DNA researcher Iona Hine.

Together, our goal is “TALK ABOUT CHANGE”.

More specifically Talk About Change is pursuing conversations about the history and power of language, particularly as experienced by people of colour. The first sessions will incorporate a provocation based on historical research, working through themes including diversity, feminisms, race, and resilience. Talking, sharing, debating, we hope participants will join us and engage in acts of creative resistance—in thought, speech, and writing.

What are we actually doing?

Throughout July and August, novelist and creative writer Désirée Reynolds will be leading a series of workshops, hosted by Our Mel, to discuss words and themes including race, feminisms, and diversity. The July workshops are themed and will each include input from a University of Sheffield researcher. The August workshops continue to explore related ideas, developing creative writing under the common heading “writing is resistance”.

Those who choose may publish their writing in an anthology, and we will also present a collective spoken-word performance (optional!) on Sunday 23 September as part of the Festival of the Mind programme.

Who can participate?

Our Mel issue a collective invite to come along and engage in conversation about “words that affect us every day”. What have they meant, how are they used, and what do they mean to us?

People of all ethnicities are welcome and an embracement of heritage is welcomed. Participation is limited to over 18s.

Visit Our Mel’s website (ourmel.org.uk) for more information about the workshops.

ABOUT THE COLLABORATORS

OUR MEL

Rooted in Yorkshire and based in Sheffield, OUR MEL is a social enterprise dedicated to exploring cultural identity, Black history and what it means to be a person of colour in Britain today. Inspired by two local lasses (Annalisa Toccara & Gabriela Thompson-Menanteaux) on a journey of self-love, Our Mel was born in November 2016 over a pack of caramel biscuits and a cup of tea, Yorkshire of course. Since its birth, Our Mel has grown into a community of people on a mission to support, encourage, teach and build the community through music, film, arts and education. In October 2017, we launched Sheffield’s first collaborative Black History month festival, MelaninFest, and its sister MelaninFest in London. 1300 people attended 43 events in Sheffield and 5 in London. Our Mel has been at the forefront of creating diversity, inclusion and representation in Sheffield since November 2016, working in collaboration with festivals and organisations both nationally and internationally. ourmel.org.uk @our__mel

ANNALISA TOCCARA is a Marketer & PR professional, Community Activist & Creative Director. Based in Sheffield and founder of the social enterprise Our Mel, Annalisa launched Sheffield’s first Black History Month Festival; MelaninFest® in October 2017, which saw a total of 43 events spread across the month in collaboration with over 40 partners and also launched a sister festival in London. Since then, Annalisa has hosted a number of community events celebrating Black excellence, Black talent and Womanhood. Through her work with Our Mel and previous social justice endeavours, she has developed a passion for arts and culture having seen first-hand how creative mediums can help shape and create social cohesion within our community. Annalisa also has a BA (Hons) in Biblical Study and Applied Theology with a Diploma in Leadership and is currently studying for her Chartered Marketer status. She is also the Vice-Chair of the BAMER Hub – Sheffield’s Equality Hub Network. ourmel.org.uk @sparklelikegold

DÉSIRÉE REYNOLDS started her writing career in South London as a freelance journalist for the Jamaica Gleaner and the Village Voice. She has since written film scripts, poetry and short stories. Some of her shorts are published on SABLE E-Mag and various anthologies. “Seduce” her first novel was published by Peepal Tree Press in 2013, to much acclaim. She continues to work as journalist, teacher, broadcaster and DJ. Desiree is currently working on a collection of short stories, a novel based on the Haitian revolution and her PhD. — “After spending a lot of time, doing lots of things, I’m finally where I’m supposed to be, doing what I’m supposed to do.”
desireereynolds.co.uk peepaltreepress.com/authors/desiree-reynolds
youtu.be/qkNrQ-HMwLs peepaltreepress.com/books/closure
@desreereynolds

500 REFORMATIONS

500 REFORMATIONS collaborates with external partners to explore and tell stories of change, from the cultural to the personal. Based at the University of Sheffield, 500 Reformations draws on research from across the Faculty of Arts and Humanities. Activities are united by the theme of reformation, whether writ big (as e.g. churches breaking away from Roman Catholic control in the sixteenth century, ‘the Reformation’) or small (in individual stories of change, development and re-form). 500reformations.group.shef.ac.uk @500Reformations

Of concepts and kings: curating a collection using EEBO-TCP

In Spring 2018, MA student Sophie Whittle dedicated 100 hours of hard graft to filling blanks in the Terms metadata. This work follows on from Winnie Smith’s work last year, identifying gaps in the Text Creation Partnership’s metadata. Sophie reflects on her experience and offers some tentative analysis focused on texts from the English Civil War and Commonwealth era:

As an MA linguistics student, I was elated to see an opportunity to apply for Linguistic DNA as part of the School of English’s work placement module. I am looking to apply for a PhD in the future, and I saw the project as a chance to undertake independent research. After completing a corpus-based dissertation about the semantic-syntactic development of the verb promise, I was looking forward to exploring historical texts in a different way—through the use of EEBO-TCP and LDNA computational methods.

After undertaking three years of linguistic study, I realised I had not delved into much historical work as my interests were very much theoretical at the time. Working on the placement has allowed me to regain historical interests which I had left back at GCSE and A-Level. During the summer prior to Masters study, I was in conversation about how King Charles I, who had intended to dine with the Governor of Hull (my home city) was stopped at Beverley Gate by Parliament. As the Governor had expressed allegiance to the Parliamentarians, he was named a traitor, but Charles was forced to return to York. This defining moment, not only for Hull but for Civil War history, piqued my interest. I was able to reflect on the conversation I’d had over summer whilst contributing to the inputting of empty metadata terms for LDNA and viewing documents of news and conflict from the Civil War. I wanted to understand the general public’s attitudes to the different sides of the Civil War in England by using LDNA’s conceptual modelling methods.

Towards the end of my placement, I wrote a proposal for a Civil War and Commonwealth collection to be included on the final LDNA interface. The large amount of texts from this period (a search via ProQuest’s EEBO interface (eebo.chadwyck.com) brings back 35,008 records) are of interest to researchers from linguistics, literature, history, theology, etc. Modelling concepts from and across the period, by determining frequently co-occurring words as pairs or trios, makes it clear that there is a wealth of information to research. My proposal employed the following term definitions:

Title contains: king
Terms contains: civil war, or commonwealth
Date range: 1642 to 1660

By defining these parameters, the idea was that a user could analyse attitudes towards different sides in the Civil War, from both Royalist and Parliamentarian perspectives.

While designing my collection, I also wanted to see if I could mimic LDNA’s use of ‘windows’. I came across a four-text sample about the preserving of peace during the Civil War, with each text from different viewpoints. For instance, one of the texts discusses the New Model Army, a dissenting faction of the Parliamentarian side, as the ‘obstructors’ to peace in the kingdom (A25836). The army was largely independent and held radical Puritan views during the Civil War. Alternatively, a different text suggests peace was only possible if King Charles I prospered, rejecting Parliamentarianism (A25857). I used the node words ‘King’ and ‘Parliament’ as a starting point and counted ten words either side the node words (W20). I then calculated the frequency of the words with the most tokens, and came up with the following table (the frequency is displayed as a percentage):

Table showing cooccurrences with the nouns king and parliament across sampled texts.

Sampled cooccurrences with ‘king’ and ‘parliament’ for a window of 20 words.

The percentages are not particularly high (due to the size of the window, perhaps increasing the size of the window might help solve this issue). However, by identifying the ten most frequently co-occurring lemmas with the node words, a number of interesting results appear. For instance, peace co-occurs more frequently alongside King than Parliament, suggesting something about the bias of the texts. Further close-reading might indicate why peace was continuously associated with Royalism (or not, which emphasises the importance of close-reading!).

Additionally, the reason for vote co-occurring alongside Parliament might seem obvious at first. Yet analysing this within its context provides a different story. Most of the co-occurrences appear in the text slandering the New Model Army. In this text, the author believes that Parliament are voting in response to pressures from the NMA, and are therefore void of their privileges as a democratic union. A list of evidence is provided by the author to explain how Parliament have been revoked of their privileges (during the ‘tumult of the Apprentices’, when apprentices were freed from their masters and asked to join the NMA in ‘a state of confusion’). In the author’s view, the NMA forced Parliament to go against their morals by undoing their previous work.

I also gained access to trio output data from Susan and her work on Newsbooks, to see if I could find something similar. The top concepts for ‘King’ and ‘Parliament’ are ‘king – lord – parliament’ and ‘parliament – state – council’ respectively, which are expected from the genre of texts. By looking at slightly less frequent trios, there are some more intriguing items. The following trios complement the data I had found manually from the four-text sample: ‘king – people – liberty’ with a PMI of 3.09 and ‘parliament – present – authority’ with a PMI of 4.09. (Both PMIs show that the observed trios occur more often than expected by chance.) There is so much data to explore here, highlighting that LDNA should be an excellent resource for conducting quantitative study. As shown, it is important to analyse the findings within their contexts to specify how concepts are cemented in history. LDNA promote the combination of distant (using statistical methods) and close reading. It was interesting to imagine the final interface with the addition of literary analysis.

Working with the TCP metadata has allowed me to explore concepts within the Civil War and Commonwealth period and finalise the work placement by writing a collection brief. As a linguistics student, it has been a real challenge to identify texts based on their literary genre. This pushed me out of my comfort zone. Being able to use my semantic skills to pull apart the meaning behind the conceptual findings has helped too. I am very grateful to have been given the opportunity to use my existing skills and gain new ones!

Featured image: Hull City Skyline. From an original photograph by John Bannon. Used under license CC 3.0.

Scientific Prose in EEBO-TCP

Among the collections provided by Linguistic DNA is one called SuperScience, a broad set of early modern science-writing originally gathered together by Dr Alan Hogarth of the Visualizing English Print project. Here he outlines some other ways to explore those texts.

In September 2017, LDNA researcher Iona Hine presented some work with TCP metadata at DRHA’s dataAche conference. In this guest post, DRHA co-panellist Alan Hogarth (pictured) examines the fruits of his own labour with EEBO-TCP. Alan was responsible for assembling sub-collections of scientific writings as part of the Visualizing English Print group in Strathclyde.

Alan writes:

Using EEBO-TCP’s metadata to search for early modern scientific texts can be a complex activity. The TCP Terms Listings include references from the extremely broad (e.g. ‘Science – Early Works’), to the very specific (e.g. ‘Turpentine – Therapeutic use’). So unless you have a clear idea of what you’re looking for, some texts are likely to escape your notice. This problem stems from the generic diversity and disciplinary fluidity that characterises early modern science. So how do we classify texts that resist classification?

With the Visualizing English Print Project, I attempted to do this very thing. Extending those generic boundaries as far as I could, I created a corpus of just under 2000 scientific texts, drawn from the period 1482-1710.

In this post, I’m going to use the science corpus to examine texts from the first group to identify as a community of like-minded scientists: the Royal Society. What were the most common linguistic features of Royal Society prose?

There are 233 Royal Society texts in the corpus, 11.8% of the whole (see Figure 1).

Figure 1 (left) Percentage of Royal Society Texts vs Rest of Science Corpus, 1482-1710. Figure 2 (right) Percentage of Royal Society Texts vs Rest of Science Corpus, 1660-1700.

Whether seen as a whole or in comparison to other texts from the period 1660 (when the Royal Society formed) to 1700 (Figure 2), Royal Society publications make up a small fraction of the total number of scientific texts. Fellows of the Royal Society wrote about subjects that spanned existing and emerging scientific fields. Yet these relatively small figures are partly explained by the scope of the corpus, which includes multiple disciplines, ranging from medicine to astrology.

The Language of the Royal Society

As well as promoting a new experimental method, key members of the Society (e.g. Robert Boyle, Thomas Sprat) were known for advocating a ‘plain’ style with which to recount experimental activities. On this basis, we might expect Royal Society texts to exhibit a certain coherence, when compared with non-society affiliated scientific texts. Figure 3 (below) shows the entirety of the science corpus (visualised using Principal Components Analysis), with Royal Society texts highlighted in red:

Fig. 3 PCA Scatterplot showing Science corpus, with RS texts highlighted.

Each text has been tagged using Docuscope software. Docuscope counts frequencies-per-text of the rhetorical categories shown on the loading plot (right). These categories contain words, phrases and punctuation. The dots in the lefthand scatterplot represent individual texts in the corpus. The scatterplot highlighting reveals that the highest density of Royal Society texts occurs in the lower right quadrant. From the categories in that quadrant we can tell that this kind of writing is broadly discursive. These categories include:

‘Comparison’, which tags comparative and superlative adjectives, such as ‘more than’ or ‘the largest’, as well as a variety of terms and phrases for difference;
‘Deny Disclaim’ (which tags terms such as ‘no’, ‘not’ and ‘none’); and
‘Private Thinking’, which captures words and expressions that deal with the processes of thought: ‘think’, ‘intellect’, ‘imagine’ etc.

But the graph also shows a distribution of Royal Society texts in the upper and lower left quadrants. The upper left quadrant has a high proportion of texts, such as medical recipe books. The lower left quadrant contains texts on mathematics and physics. Broadly speaking, discursive types of writing score high on Principal Component 1 (the x-axis), whilst procedural writing scores low. Perhaps unsurprisingly for a group with wide-ranging interests, Royal Society texts feature both procedural and discursive styles across Principal Component 1.

In order to get a closer look at the kinds of language used by members of the Royal Society, we can compare works by individual authors. Figure 4 shows the works of three figures: Margaret Cavendish, Duchess of Newcastle; Thomas Burnet; and the Society’s founding figure Robert Boyle.

Fig. 4 Table of texts under analysis by Boyle, Burnet and Cavendish, with publication date.

Margaret Cavendish’s philosophical outlook was at odds with that of the Royal Society. She was a vocal opponent of the experimental method and in her Observations upon Experimental Philosophy (1666), famously attacked new developments in microscopy, pioneered by Robert Hooke and Henry Power. In 1667 she became the first woman to attend a meeting of The Royal Society, during which she was shown a number of experiments. This demonstration did nothing to lessen her scepticism, or alter her commitment to rationalist principles. Burnet’s fame in the late seventeenth century mostly derived from his publication of The Sacred Theory of the Earth (1684), an account of the origins of the Earth after the flood (Genesis 6:9-9:17). His interactions with The Royal Society extended to correspondence with Isaac Newton in the winter of 1680-81. But the influence of The Sacred Theory was such that many Fellows of The Royal Society engaged with the book’s themes, including Edmund Halley and William Whiston.

Boyle, Cavendish and Burnet were selected because of their relative closeness within the lower right quadrant of the first scatterplot above (Fig.3). Each author’s texts can be classified as discursive prose (see Fig. 4). When compared against each other (Figure 5, below), the texts group by author, which means there may be stylistic differences at play, variations in subject matter, or both.

Fig. 5 PCA scatterplot comparing texts by Boyle, Burnet and Cavendish.

Boyle’s Language

The following graph (Figure 6) provides an indication of some of the linguistic variables that separate these texts into author groupings:

Fig. 6 Means of ‘Curiosity’, ‘Self Disclosure’ and ‘Intensity’ across works of Boyle, Burnet and Cavendish.

The graph measures the mean of individual linguistic categories, and shows the relative frequencies with which they are used by each author. So, Boyle’s texts use the language of ‘Curiosity’, ‘Intensity’ and ‘Self Disclosure’, more than those of Burnet and Cavendish.

To see what these linguistic features actually look like, here’s an example from Boyle’s Certain Physiological Essays (1669):

Extract from Boyle’s Certain Physiological Essays (1669), highlighting features of Intensity (blue), Self-disclosure (red) and Curiosity (green).

The linguistic category ‘Curiosity’ has captured Boyle’s methodological approach. Experimentation is at the heart of this method and Boyle refers to specific experiments throughout the essays. But ‘Curiosity’ also tags the language of experimental processes that precede discovery: ‘show how’, ‘when it is filtrated’. Boyle’s use of ‘Self Disclosure’ (‘I have diverse times’) assures the reader of his personal experience of conducting experiments. The frequent intensifiers (‘strong’, ‘very’, ‘so many’) create an impression of the thoroughness of descriptive detail, the kind of detail required if one were intending to replicate the experiment. Boyle professed that replication was a key aim of committing his experiments to print.

Burnet’s Language

This graph highlights some of the linguistic features that are most characteristic of Burnet – ‘Immediacy’, ‘Space Relation’ and ‘Intensity’:

Fig. 7 Means of ‘Immediacy, ‘Space Relation and ‘Intensity’ across works of Boyle, Burnet and Cavendish.

As you can see from Figure 7 (above), Boyle’s texts actually exhibit a little more Intensity than Burnet’s. I’ve included it here because Burnet uses intensifiers as a means of conveying the enthusiasm of his observations in a way that differs from Boyle.

Take this example from The Theory of the Earth (1697):

Extract from Burnet’s The Sacred Theory of the Earth, highlighting Space-Relation (green), Intensity (blue) and Immediacy (red).

Intensifiers, such as ‘greatest’, ‘most’ and ‘boundless’ contribute to this proto-romantic description of the mountains and seas. Given the geographic subject matter, it is unsurprising that Burnet’s use of ‘Space Relation’ eclipses that of Boyle and Cavendish: Docuscope has tagged phrases that indicate spaces (e.g. ‘the Earth’ ‘the tops of mountains’), but also an individual’s perception of these spaces (e.g. ‘behold’). ‘Immediacy’ adds to this immersive stylistic effect, which is achieved through a combination of geographical description and invitations to the reader to imagine him or herself within the landscape that Burnet describes: ‘let us now go’, ‘these things’, ‘these Mountains’.

Cavendish’s Language

With Cavendish, subject matter and style are different again. The following graph provides an impression of what separates her work from that of Boyle and Burnet:

Fig 8. Means of ‘Deny Disclaim, ‘Motions’, ‘Reason Forward’ and ‘Resistance’ across works of Boyle, Burnet and Cavendish.

The linguistic category ‘Motions’ captures words and phrases relating to various kinds of movement. The remaining categories (‘Deny Disclaim’, ‘Resistance’ and ‘Reason Forward’) register a pattern of argumentation in Cavendish’s texts.

We can see how these operate in this example from Observations upon Experimental Philosophy (1666), as Cavendish questions the epistemological value of the telescope and microscope:

Extract from Cavendish’s Observations upon Experimental Philosophy, highlighting Resistance (purple), Deny-Disclaim (blue), Reason-Forward (green) and Motions (red).

The ‘Motions’ that Cavendish describes (‘figurative motions’, ‘mix’, ‘sow’) are cited as examples of the essential uselessness of the experimental method: Microscopes may discern the motions of insects or reflected light in ‘Anatomical dust’, but they cannot provide answers to the larger questions, the causes of things. Cavendish’s rhetorical method supports her philosophical position: to reason, rather than to trust the senses. To this end, her pattern of reasoning is causal and tends to begin with refutations: ‘not found’, ‘no great benefit’, ‘neither’ and ‘nor’. Docuscope tags these as ‘Deny Disclaim’. Cavendish then moves onto expressions of contrast: ‘but are again dissolved’, ‘fallacies, rather than discoveries’ (tagged as ‘Resistance’). She ends with more assertive phrases that close the logical sequence of her argument: ‘so that a Gardener…’. These are tagged as ‘Reason Forward’.

The following example from the same text illustrates the circularity of this linguistic pattern:

Another extract from Observations upon Experimental Philosophy, highlighting Resistance (purple), Deny-Disclaim (blue) and Reason-Forward (green).

To conclude

This brief foray into the linguistic features of scientific prose has only scratched the surface of the differing stylistic choices that determine how a scientific author might write about a particular subject. Factors such as generic diversity mean that quantitative analysis of the whole science corpus can only give us a partial picture of how scientific language may have developed over time. This is before we take into account the incompleteness of the corpus, and of EEBO-TCP. But by looking at the relationship between specific texts and authors from a particular period, we can get a better impression of linguistic traits.

For example, Boyle, Burnet and Cavendish share an interest in the physical world. All of their texts discuss sensory objects and their motions. But their approaches to natural philosophy diverge. What’s interesting is how each author’s language use appears to support their respective methodological approaches to scientific inquiry. Where Boyle and Burnet foreground very different conceptions of experience as the way towards knowledge, Cavendish asserts that truth can only be reached through a process of rational thought, a process that can be detected in her rhetorical practice.

Header image: Dr Alan Hogarth presents at DRHA 2017 dataAche.

Translation, Gender, Sexuality: a report from Genealogies of Knowledge 2017

In December 2017, Sheffield MA student Nathaniel Dziura attended part of the Genealogies of Knowledge conference in Manchester. While the LDNA team were exchanging conceptual insights with other data-driven scholars, Nathaniel participated in sessions connected to a different field of interest. He writes:

As a member of the LGBTQ+ community, I am keen to contribute to research on how social factors impact language use, particularly gender and sexuality. As a second-generation Polish immigrant, raised with influence from both Polish and English culture, I am also very interested in the effect cultural background can have on the production of linguistic features.

Next year, I hope to start a PhD focused on this interplay between social and linguistic elements. Schumann (1978) suggested that the degree of ‘acculturation’ influences use of non-standard variants in second language learners. In other words, if the speaker is more immersed in the culture of their second language, they will be more likely to acquire native speaker-like linguistic variation. However, previous studies have not considered how other social factors such as sexuality might affect which features are acquired. This is despite previous studies having shown certain linguistic features to be cross-culturally associated with LGBTQ+ membership. These features include fronted-/s/ (Levon, 2006; Pharao et al., 2014) – colloquially stereotyped as the ‘gay lisp’ – and creaky-voice (Zimman, 2013: 3) – speaking with a low elongated ‘creak’, like a stereotypical ‘valley girl’. LGBTQ+ people do not inherently use these features, but they can play an important part in interaction (Barrett, 2017: 9).

I want to help fill this gap in the research by investigating how sexuality might affect the linguistic variants acquired in English by second language speakers (specifically, Polish migrants to England). I will examine whether the use of these features differs depending on two variables: the level of integration into British culture. And the level of involvement with the LGBTQ+ community.

This was the project I had in mind as I headed to Manchester for the conference. I was rewarded by an excellent thematic session on ‘Translation, Gender, Sexuality’.

I found Przemysław Uściński and Agnieszka Pantuchowicz’s presentations to be pertinent and insightful. Uściński’s talk focused on the downfalls with approaching Queer Theory in Poland from a ‘Western perspective’. The political environments in Poland and England have differed historically, and continue to do so. Uściński argues that ‘LGBT emancipation’ has not yet occurred in Poland. Critical theorisations of gender are intentionally scarce in Polish academic discourse. The reception of Queer Theory in academia has been comparatively belated, and has sometimes discredited the LGBTQ+ movement. British society has its share of problems with LGBTQ+-phobia. Yet, Poland has seen much far-right and religious rejection of the LGBTQ+ community. These groups have dismissed LGBTQ+ identities as ‘Western secular propaganda’ and ‘gender ideology’. So, English translations of concepts within Queer Theory, which are gradually being introduced to Polish academic works, reflect English notions and societal progress. Even when concepts from Queer Theory enter Polish, there is no possibility for their dissemination within Polish society. Queer Theory tends to be viewed as a ‘foreign’ and subversive concept. A theoretical importation into Polish from English, and not one congruous with Polish culture.

In another paper, Pauline Henry-Tierney noted that misinterpretations in translation of Beauvoir’s ‘Mauvaise Foi’ have slowed academic progress on the subject. Taking this into account, perhaps misinterpretations of Queer Theory as a ‘foreign’ concept to Poland are hindering the normalisation of LGBTQ+ concepts and perpetuate their perception as something radical and provocative.

This thematic session highlighted that introducing concepts into a language through translation can be a step towards spreading those ideas within another culture. However, this alone might not be enough to achieve society’s understanding and acceptance of those concepts. The translation of Queer Theory between cultures was not an issue I had previously considered. This thematic session reinforced that the political and social environments in Polish and English culture exhibit stark differences. This is significant within the framework of acculturation: LGBTQ+ community membership is arguably more accepted in British culture, and consequently so are associated non-standard language features. So one might predict that LGBTQ+ Polish migrants to England who become more British-acculturated are more likely to produce non-standard features associated with LGBTQ+-community membership than those who are less British-acculturated.

Overall, I was able to interact with academics from areas such as translation studies and politics with whom I would not otherwise be able to network. I am very grateful to the Linguistic DNA team for inviting me to attend the conference. The insights it has given me will be useful in my academic pursuits!

Featured image:
Jaap Verheul (Utrecht) presents an example from ShiCo research at the Genealogies of Knowledge conference, 8 December. Photo (c) I.C. Hine.

References:

Barrett, R. (2017) From Drag Queens to Leathermen: Language, Gender, and Gay Male Subcultures (Studies in Language Gender and Sexuality) Oxford: Oxford University Press

Henry-Tierney, P. (2017) ‘Translating in ‘Bad Faith’? Articulations of Beauvoir’s ‘Mauvaise Foi’ in English’, Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Levon, E. (2006) ‘HEARING “GAY”: PROSODY, INTERPRETATION, AND THE AFFECTIVE JUDGMENTS OF MEN’S SPEECH’ American Speech 81 (1): 56–78

Pantuchowicz, A. (2017) ‘Translation and the Failure of Gender Mainstreaming in Poland’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Pharao, N., M. Maegaard, J. S. Møller & T. Kristiansen (2014) ‘Indexical meanings of [s] among Copenhagen youth: Social perception of a phonetic variant in different prosodic contexts’ Language in Society 43, 1–31

Schumann, J. H. (1986). Research on the acculturation model for second language acquisition. Journal of Multilingual and Multicultural Development, 7, 379-392

Uściński, P. (2017) ‘Thinking Sexuality/Translating Politics: Queerness in(to) Polish’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Zimman, L. (2013) ‘Hegemonic masculinity and the variability of gay-sounding speech: The perceived sexuality of transgender men’ Journal of Language & Sexuality 2 (1): 1-39

Seth and Iona present a joint paper with LDNA data at Genealogies of Knowledge. Photos (c) Japp Verheul.

A distant look at EEBO-TCP’s “Controversial Literature”

Iona writes: To explore the contents of EEBO-TCP in a distant fashion (and give context to Linguistic DNA data), I have continued to experiment with the Text Creation Partnership’s metadata. Some of this work has been documented in conference papers presented at SHARP and DRHA. Having spent time examining the metadata for items (said to have been) printed in the 1550s and in the years 1610 and 1611 (to be documented in a subsequent blog post), I have now moved on to experiment with how categories observed in those slices manifest across the broader dataset.

One prominent category provided by the TCP Terms listings (a part of the TCP metadata that derives from the same source as much of the ESTC, and is in debt to prior cataloguing efforts) is “Controversial Literature”. About 5.5% of EEBO items in the GitHub TCP catalogue carry this label. Using time as a primary dimension (being the simplest metadata entity to map), it is possible to see how the distribution of “Controversial Literature” in the 1550s compares with other decades.

The following chart plots the raw quantity of EEBO-TCP items labelled “Controversial Literature” in each decade:

“Controversial Literature”: raw number of items in TCP by decade

For the first half-century of EEBO-TCP (1470-1519), no items carry the label. This is not entirely surprising: The quantity of surviving print items transcribed for this early period is much slighter. The next chart plots the raw quantity of all items in EEBO-TCP by decade as an aid to comparison:

Chart showing gradual growth, then increase on a different scale.

All EEBO-TCP: items plotted by decade of printing.

The sparsity of the early period is visible. The drop in the final column (representing the decade beginning 1700) in both charts is a by-product of decade apportionment given the criteria that define EEBO: 817 items carry the date 1700, very few post-date that. Looking at the second chart, the quantity of items printed grows steadily until the 1630s (a slight dip). From the 1640s, the trend breaks down as overall quantities increase significantly: more than 76% of EEBO-TCP items post-date 1640.

It is possible to gain some insight by comparing these two charts. For example, a dip in the 1630s is common to both TCP-whole and Controversial-part. We are not limited to the raw perspective: The Controversial values can be normalised by the total number of EEBO-TCP items from the given decade (dividing raw counts), to show how the share of Controversial literature (v. Items not-labelled-Controversial Literature) varies by decade.

Each bar represents 100% of EEBO-TCP items for the given decade

This mode of representation may make Controversial Literature seem like a marginal category when compared with the mass of other items. Yet in four decades (1540s, 1550s, 1580s and 1600s) this category makes at least 10 per cent or more of the total. If we retain the normalisation but view the information (i.e. percentage of EEBO-TCP items labelled as Controversial Literature by decade) in the plain chart format, the earlier perception of growth is effectively challenged:

Despite overall variation, the trend is downward.

On the wane: “Controversial Literature” as percentage of EEBO-TCP by decade.

From the high point of the 1540s, the market share of “Controversial Literature” appears to decline (if we allow ourselves to conceive of TCP as a market).

Of course, decades are an imposition. This is made evident when we consider the two-year period 1610 to 1611. The following chart counts EEBO-TCP items across 2-year periods, with a trendline tracking the moving average across 5 such periods (i.e. spanning a ten-year period).

“Controversial Literature”: Raw counts of items in two-year periods, with decade moving average.

The overall pattern remains one of exponential growth (a fitted curve reaches R-squared value of .916). For the two-year period beginning 1610, the item count is visibly below the relevant average, and slightly shy of the growth curve.

Here’s a snapshot from the relevant data:

Data table showing 274 items for 1610-11.

Total quantity of EEBO-TCP items for specified 2-year period(s).

The next chart plots “Controversial Literature” data in the same way, albeit omitting the periods from 1522 to 1527 (because their zero values prevent exponential curve plotting). Here, the period 1610-1611 stands proud of the growth curve, matching closely the moving average. Considered in terms of the (raw) quantity of “Controversial Literature”, this period has more in common with the 4 preceding ones (1602-1609) than those that follow.

“Controversial Literature” items by 2-year period, with decade moving average and trendline.

This is the equivalent snapshot from the raw data:

Total quantity of “Controversial Literature” items for specified 2-year period(s).

Finally, beginning in 1528-9 (when “Controversial Literature” becomes continuous) and ending in 1698-9 (to avoid a misleading dip caused by the relative absence of items in 1701), here is the distribution of Controversial literature as a component of TCP across the two-year periods.

“Controversial Literature” as a component of EEBO-TCP, 1528-1699: Each bar represents 100% of the EEBO-TCP items for the specified 2-year period.

In this view, we can see firstly that the peak occurs in 1548-9, but also that (due to the combination of a peak in items so-labelled and a relative dearth of items) in 1610-1611 so-termed “Controversial Literature” has an atypically high stake. Its 15.3% is the second highest point across the 172-year span.

The final table offers a snapshot from the data for comparison:

How raw values convert to percentages: “Controversial Literature” and other EEBO-TCP items for specified 2-year period(s).

And if all this leaves you wondering what constitutes “Controversial Literature” (and why it matters for us), that question will form the core of my next LDNA blogpost.

LDNA at MozFest

Fraser writes:

The Glasgow branch of LDNA attended this year’s Mozilla Festival (better known as MozFest) to help discuss the potential for linguistics to shape future development of the Internet of Things (IoT). The IoT is a catch-all term for the growing number of networked devices with which we interact every day, many of which – such as Amazon’s Alexa – require the ability to respond to spoken language commands. The Mozilla Foundation is increasingly concerned that voice data is controlled by corporate entities, and is attempting to gather its own source via the Common Voice project. In collaboration with Glasgow researchers on the AHRC’s Digital Transformations theme, myself and Marc Alexander spoke at a session designed to showcase some of the ways in which linguists gather, annotate, and process language data.

MozFest itself is a sprawling event, taking over Ravensbourne College in the North Greenwich area of London, and covering it in semi-impromptu meeting, discussion, coding, and maker spaces. It brings together enthusiasts for technological development, and puts otherwise disparate groups in the same place, allowing them to share ideas. Amongst the discussions I attended were an introduction to an open access academic publication platform, Aletheia; a meditation on the ways different types of media consumers might identify fake news; and an account of a project which encouraged hospital patients to design physical objects as repositories for electronic recordings of important stories and memories.

The atmosphere of the event was very similar to public engagement work the Glasgow team has previously undertaken, for example at the European Researchers’ Night ‘Explorathon’ events. The mixture of different people and different subjects offers inspiration often through unexpected juxtaposition of projects and ideas. We were, almost certainly, the only linguists present. When you consider, however, the extent to which the future of human-technology interaction is likely to rely on voice and language processing, I hope that our contribution to the field is only just beginning. There is a wealth of established knowledge and groundbreaking research being conducted on language that surely must be useful to the people who are teaching machines to understand us – from the ways in which communicative data is encoded in grammatical structures, to LDNA’s interest in the way concepts can be identified and traced through their textual contexts.

The discussion following the team presentations raised points that had never before occurred to me. Instead of expanding the degree to which computers understand human language are we racing towards limiting our vocabulary and speech patterns to only those which we already know can be understood? Is this, then, a good thing or a bad thing? How does an automatic language analysis system keep up with changes to grammar and syntax even if it could be perfectly trained on the language of today? My hope is that we gave some attendees something to think about, and appreciate the feedback and discussion which certainly broadened my horizons.

Documenting categories in EEBO-TCP data

As part of a work placement with Linguistic DNA, University of Sheffield MA student Winnie Smith has been examining the metadata that accompanies the Text Creation Partnership transcriptions of Early English Books Online (EEBO-TCP). Released as a CSV file (here), it combines the various codes that are used to identify different formats of EEBO (and related products). It also includes basic information like author, title, and date of publication. The focus of Winnie’s work described here are the “terms” or “subject headings”, which represent earlier attempts at cataloguing the items in the collection. What can these tell us about what’s in our main dataset?

Winnie writes:

One of the main reasons I’d applied to do a placement with LDNA was the opportunity to work with historical text, so I was very happy be given the chance to analyse EEBO-TCP metadata. Time disappears browsing EEBO (I never knew an agnus dei could be a little wax figurine as well as a mass movement), but I soon realised that analysing it in any depth is a pretty warts and all experience.

The initial target was to select a narrow date range, say 1610-1612, and try to identify broad genre and / or format categories for EEBO documents in that period (e.g. sermons, broadsides, or biographical works). The data sent back from this probe would then help LDNA in its mission to explore the “universe of printed discourse”.

However, getting the data turned out to be more complicated than it first appeared. I decided early that trying to determine semantic categories manually was both unsystematic and impractical. For 1610 alone, for example, there were 49 overtly Christian terms. How should they be grouped together? What about the classifications that might not apply to that year, but arise in other years?

In addition, the fact that the category labels in the metadata originated in the English Short Title Catalogue (ESTC), and had been provided in different ways by different people, meant that there were considerable differences in how texts were catalogued, leading to uncertainty about how they should be classified. There is also the potential for category clashes between modern labels and early modern concepts, for example of non-Christian religions. For example, did thematic entries which mentioned “Jews” refer to Jewish religious texts, learned works, or anti-Semitic material which within an Early Modern frame might claim to be religious and / or scholarly?

Terms from below

I therefore switched to looking for semantic clusters which might emerge directly from the terms. This approach had definite benefits. By manually splitting and sorting the single “Terms” column (using semicolons as the delimiter), I was able to see categories I would have missed if I had continued to rely on manual semantic classification, e.g. “broadsides”:

Screengrab of an alphabetised table, featuring entries beginning with "Broadside".

Figure 1: “Broadsides” emerging as a category in alphabetically sorted TCP metadata.

I also noticed that there were different kinds of categories present in the terms: far from all groupings being topic-based (e.g. repentance), there were several kinds of classification—in the case of broadsides, by printed format.

However, alphabetical manual sorting was time-consuming, and not guaranteed to unite similar content. For example, the entry

Conspiracies — Sermons — Early works to 1800. (A07558)

appeared separately from

Sermons, English — 17th century (A14860).

I decided to try and address that problem by further splitting, though this was complicated by different cataloguers punctuating in different ways. This made it difficult to split off information that wanted separating without separating information that didn’t. For example, all records use a double hyphen (–) to separate off separate units within a single classification. However, in the following record, a double hyphen appears within a single unit (James I) as well as between separate units, e.g. James I | King of England:

James — I, — King of England, 1566-1625. — Triplici nodo, triplex cuneus — Early works to 1800 (A20944)

After some pre-processing in Excel to try and make the punctuation and contents of the terms column more consistent (removing full stops; replacing double hyphens in front of regnal numbers; filtering out blanks or replacing them with dummy text), I loaded it into Rstudio, full of analytical idealism.

This was where the fun began.

The metadata contains widely differing numbers of term entries for different records. I had gravely underestimated the difficulties this would pose in R, which requires rows and columns to be of equal length. The issue was only compounded by the fact that the required number of new term columns (n) was also unknown. I eventually got round this by finding n first (56, as it turns out), then making all rows have n columns. This isn’t a particularly elegant coding solution, particularly because it uses multiple for-loops (which I know R frowns on) and these were slow for a dataset of over 53,000 rows. There are also issues about essentially counting a construct, X, as a proxy for term entries, where

X = something between one semicolon and another.

Observations

Caveats aside, this procedure enabled some interesting observations. The frequency of individual terms, for example, has a very long tail: 16997 items (63% of the total), occur only once. Sorting the terms alphabetically shows that there is still a great deal of noise, at least at the lower end:

Terms including a long string split by a hyphen (row 3), and the full form of 's-Hertogenbosch with its initial apostrophe.

Figure 2: A snapshot of alphabetically sorted individual “terms” and their frequencies.

This snapshot from the output table (Figure 2) shows that some rows, e.g. 3, have been split incorrectly, despite pre-processing to try and deal with different numbers of hyphens. Other potentially tricky cases are correct (e.g. row 5). For some rows (e.g. 14–15) hidden characters clearly affected the result, and it might be worth adding extra processing steps. (I didn’t, for example, remove brackets or capitalisation.)

However, the process worked and was worth it. Even accepting that treating X as a distinct unit is an approximation for catalogue terms, and that the results are imperfect, being able to see the full number and range of cataloguing terms used in the EEBO-TCP metadata (294 378 total term quasi-tokens representing 26 858 unique quasi-types) frees them up for further independent manipulation.

Snapshot includes: Early works to 1800, History, Great Britain, 17th century, England

Figure 3: Top 10 cataloguing terms in EEBO-TCP metadata (by frequency).

It was also instructive to see which expressions (with X ≈ terms) were most frequent in the dataset. Figure 3 (right) shows the top 10, while Figure 4 (below) visualises the relative distribution of the top 25.

Bar chart showing the quantity of records covered by the 25 most frequent terms.

Figure 4: The 25 most frequent cataloguing terms in EEBO-TCP metadata.

As the bar chart shows, by far the most frequent catalogue term is wholly uninformative: all works in EEBO-TCP are pre-1800, so “early literature to …” only confirms the selection criteria which apply to the dataset.

Perhaps more interesting is that the 6th most frequent item is “not supplied”; this shows how many records lack any kind of cataloguing information: “not supplied” was a dummy entry I inserted to remove blanks in pre-processing.

EDIT (7/7/17): We now realise the 2473 works from ECCO-TCP were inadvertently included in the term processing, inflating the reported blank count (“not supplied”). The true count for EEBO-TCP is 3102 items without terms, or ~6% of the total. Only 23 ECCO items have subject headings, with 2 containing the phrase “Early works to 1800”.
Keen-eyed readers may also have spotted “England” twice in the top terms, an artefact of stray spaces (following term-splitting). Removing this irregularity, we find 20,074 unique term-units; extracting commas and parentheses leaves 19864 distinct units—the various biblical figures of Figure 2 (lines 14-16) now represented by one common “biblicalfigure”.

This [nevertheless] shows that messiness extends to the upper as well as lower frequency limits of the dataset, though perhaps in a different sense: at the higher extreme, the issue appears to be less one of dealing with potentially large numbers of small inaccuracies, and more one of high-frequency results which are not all equally useful for semantic analysis.

However, one of the things I have learned on this placement is that no result is unwelcome. Unsurprisingly, EEBO-TCP was never designed for the type of digital text analysis LDNA conducts, meaning LDNA-type processes aren’t always intuitive to apply, but this does not detract from the benefits of working with EEBO, even (especially if) the results remind you to proceed with caution.

On a more personal level, the EEBO-TCP metadata is not the easiest thing to handle as a beginner (I did vainly wish it better, quite a lot), but extremely valuable for precisely that reason. Although I didn’t get much further than writing (cracking) the code to extract individual terms from the master dataset and beginning some summary statistics, I learned a vast amount about being pragmatic and patient with your data (and your programming environment), documenting your working, and the pain of struggling to make sense of messy, non-textbook data is in fact an inevitable part of the process I most enjoy: working with historical text which was never written for a machine to breeze through.

As I finish my placement, I hope my results will allow LDNA to continue analysing EEBO-TCP in more depth, and I’m looking forward to continuing the different kinds of pleasure that come from text analysis (from wonder to perseverance) into a PhD.

Note: For those who might wish to replicate Winnie’s work (noting the disclaimer about inelegant code), here is closer documentation of that process:

I used the following procedure:

a) find n

1) looping over each row of the dataset and calculating the number of term entries for each record
2) adding each result to a vector of term lengths (see code extract 1, below)
3) finding the largest number in this vector (see code extract 2, below)

b) create equal numbers of split terms for each record

4) looping back over the dataset, splitting each row, and making up any difference between the number of terms for that record and the maximum number of terms by adding missing values (NAs) so that each row was of equal length.

c) combine all the newly created term columns into a single column, removing the missing values.

Code extract 1 (above). Code extract 2 (below).

Quantity and quality: lessons from an MA work placement

Sheffield MA student Nadia Filippi reflects on her experience after 100 hours with the Linguistic DNA team at DHI | Sheffield:

As part of my MA studies in English Language and Linguistics, I had the opportunity to undertake a work placement of 100 hours at the University of Sheffield’s Digital Humanities Institute. The placement offered a good overview on the typical tasks and responsibilities of a researcher and was an excellent choice for me because I am interested in doing research and I am considering going onto PhD research.

When registering for the placement module, I only had basic knowledge of corpus linguistics. I was accustomed to qualitative research but wanted to discover quantitative methodologies and the possibilities that quantitative research can offer. Starting my placement, I was at a stage in my studies in which I was still looking for definite answers to all my questions about research. Moreover, I respected everything to do with numbers, but the idea of actually ‘doing statistics’ made me nervous. I consciously chose a placement to force myself out of my qualitative comfort zone.

My concerns resolved themselves during the placement. I had to familiarise myself with and use statistical software packages like SPSS and lost my initial fear. I began to understand how statistics could be used effectively to discuss questions and find information that qualitative research could not do in timely manner. For example, finding out which words frequently co-occur in a large dataset. Furthermore, I came to understand that doing research does not exclusively mean to narrowly focus on finding a clear answer to an initial research question. It is often more about refining the question, developing another one and accepting that there can be more than one right answer to it.

The power of the Digital Humanities Institute lies in quantitative analysis, engaging with statistical distribution, auditing datasets and computational methods. Yet, there is still qualitative work to do. For instance, I audited and reported on qualities of the YouTube dataset, wrote summaries of previous research and searched for suitable approaches or tools (e.g. a Part-Of-Speech tagger suited to social media data), by consulting published research from similar projects.

A YouTube Convert

It turned out that the placement as a whole, the experiences I made and the tasks I was given shaped my other studies. At the beginning of my placement, the Linguistic DNA team had just started providing support for the Militarization 2.0 project, in collaboration with the University of Leeds. I was immediately drawn-in by this study of YouTube gaming discussion and it ultimately gave me an idea for my MA dissertation.

I had the chance to look through some of the 6.7m YouTube comments gathered by Nick Robinson and his team at the University of Leeds, and think through how they might be analysed for concept modelling.

Screenshot showing comments on Battlefield 1 official trailer, via YouTube (15 May 2017). https://www.youtube.com/watch?v=c7nRTF2SowQ

In exploring the comments, I had to consider the characteristics of commenters’ language and reflect on the research questions. Gaming language, for example, is filled with specialist abbreviations such as “CoD:ww2”, which stands for the game Call of Duty: World at War 2. Information about nationalities (“the Germans”) and militarised language (“disabled”, “destroyed”) may also be key to answering questions about how users’ remarks connect with video content. Close reading of excerpts helps to inform how the Sheffield team respond to the main interests of the mother project Militarization 2.0: if and how social media is militarized and what effect that has on our society and the individual citizens.

By attending meetings, I gained insights into the process and decision-making in a big research project. This included, for example:

preparing big data (should we standardise the spelling of the comments or not?)
practical obstacles, such as YouTube’s technical limitations (which prevent us from retrieving all the answers to a specific comment)
deciding which variables to include (time, author, number of likes)
time and scope (how can the resources available be matched to the aims and desired outcomes of a project?)

Knowing the kinds of challenges that such a project can face was helpful in planning my dissertation, which I will be writing over the summer. Prompted by the DHI’s YouTube work, my research will discuss the kind of language generated by exposure to military video game trailers and investigate if there is a difference between the language produced online and offline. In undertaking this research, I will work with my own corpus of YouTube comments as well as with focus groups. The qualitative aspect of my dissertation will allow me to explicitly address and discuss the violence in these game trailers within my focus groups.

Overall, the work placement has been one of the most valuable and enjoyable modules of my MA. I developed many new skills, academically as well as personally. I am more confident about quantitative approaches and numbers, as well as the importance of humanities research as a whole.

Top image shows Sheffield MA student Nadia Filippi at the Linguistic DNA and Militarization 2.0 stand at the 2017 Festival of Arts & Humanities Showcase, Sheffield. The showcase was “a fantastic opportunity to open a dialogue about humanities research and its impact with the public”.