Category Archives: Blog Archive

Anthology cover

Talk About Change: LDNA at Festival of the Mind

Last weekend, Linguistic DNA & friends took over the Spiegeltent in Sheffield city centre, as part of the University’s Festival of the Mind. Spiegeltents are a Belgian invention–tents decorated internally with mirrors, creating the perfect space to share myriad reflections. 

Over the course of two hours, we hosted a performance of new writing that emerged from collaboration with Our Mel (a Sheffield-based social enterprise dedicated to exploring cultural identity) and novelist Désirée Reynolds. Each of the pieces performed have also been published as part of a limited edition anthology: “Talk About Change: Writing as Resistance”.

The Researchers’ Introduction outlines a little more of the process that culminated in some extraordinary writing (excerpted from the print anthology):


Talk About Change: Writing as Resistance

Funded by the University of Sheffield’s Festival of the Mind, our collaborative workshops used examples of early modern word use (from the Linguistic DNA project and related research) as a starting point to think about language use today. How can the past speak to the present?  How might the present speak to the past

As reflected in the structure of this anthology, the workshops explored four central themes: diversity, feminism, immigration and race. These were selected by Annalisa and Désirée, who also provided the extra focus on “writing as resistance”. In each case, the Linguistic DNA researchers sought to introduce historic material that might prompt conversation about the themes—and perhaps even fuel the resistance. Some input drew on prior research (especially for feminism and immigration sessions, which drew on Iona’s thesis and engaged also with the 500 Reformations project). As often, it was a basic excursion into early modern material—with a beginners’ introduction to linguistics and studying meaning (courtesy of Seth)

The most inventive work happened when we brought this material into the open sessions

Together with all who attended the workshops, we compared the role of diversity in historic texts to its position in modern culture: what once characterised a multiplicity of opinion is now used paradoxically of something individual. We considered aspects of feminist debate before the word feminism existed, exploring how the power of virtue changed as men (mostly) discussed the role of women in sixteenth-century England. Using texts about strangers, we examined parallels between the way people wrote (and complained) about early modern outsiders and modern discourse about immigrants. We reflected on the roots of race, its links to kinship, descent, and community and the relationship between structures of language and structures of power.

In each session, novelist and creative-writing facilitator Désirée Reynolds recommended other writings to bring out different dimensions of the themes. Wide reading was encouraged, and what you will find in the pages that follow reflects the careful crafting of a range of experience and inspiration drawing on at least five centuries of language use.

Anthology coverIt is Writing as Resistance.

It comes from Talking About Change.


If you would like a copy of the anthology (free!), you can register interest (first come, first served) by filling out a short Google form.

(You can also read some words from the Editor, over on the 500 Reformations website.)

Linguistic DNA at SRS 2018: Abstracts

Knowledge, truth and expertise: experiments with Early English Books Online

Wondering what Linguistic DNA is bringing to the Society for Renaissance Studies? Here are the abstracts for two panels of papers, and information about our hands-on demonstration session (drop in).

United by a common interest in data-driven approaches to meaning and a focus on the transcribed portions of Early English Books Online (EEBO-TCP), this interdisciplinary panel brings together new research from the Linguistic DNA project and the Cambridge Concept Lab. 


What is EEBO anyway? Contextual study of a universe in print
Iona Hine and Susan Fitzmaurice (University of Sheffield)

Since 2015, the Linguistic DNA team has been developing methods for mapping meaning and change-in-meaning in Early Modern English. Our work begins with the hypothesis that meanings are not equivalent with words, and can be invoked in many different ways. For example, when Early Modern writers discuss processes of democracy, there is no guarantee they will also employ a keyword such as democracy. We adopt a data-driven approach, using measures of frequency and proximity to track associations between words in texts over time. Strong patterns of co-occurrence between words allow us to build groups of words that collectively represent meanings-in-context (textual and historical). We term these groups “discursive concepts”.

The task of modelling discursive concepts in textual data has been absorbing and challenging, both theoretically and practically. Our main dataset, transcriptions of texts from Early English Books Online (EEBO-TCP), contains more than 50 000 texts. These include 9000 single-page broadsheets and 162 volumes that span more than 1000 pages. There are 127 items printed pre-1500, and nearly 7000 from the 1690s. The process of analysis therefore requires us to think carefully about how best to control and report on this variation in data distribution.

One particular question that has arisen affects all who attempt to use EEBO: what is in it? To what extent is its material from pre-1500 similar in kind (genre, immediacy, etc.) to that of the messy 1550s (as the English throne shifted speedily between Edward VI and his siblings), the 1610s (era of Shakespeare and the King James Version), or the 1640s (when Civil War raged)? This paper is a sustained reflection on attempts to find out “What’s in EEBO?”


In the beginning was the word?
EEBO-TCP and another universe of meaning

Seth Mehl (University of Sheffield)

When a new idea is conceived, how does it find expression in language? Between 1450 and 1750, the English lexicon expanded dramatically, and literary scholars, philologists, linguists, and historians have sought to document and demonstrate the paths taken by key social and cultural vocabulary, charting the history of what would become key social and cultural ideas, discourses, and concepts. In such cases, the topic and language for investigation has been intuited on the basis of extended qualitative reading, and the objects of investigation tend to be individual words. With the advent of a searchable database of early modern texts, such intuitions can be tested at scale, and the initial object of inquiry can shift from individual words to relationships between sets of words.

What happens when we invert the traditional process, taking the thousands of texts digitised in EEBO-TCP and applying computational techniques to model language change independent of human intuition? Can such techniques indicate meaningful relationships between key words that human researchers had not intuited or observed? To what extent do observations founded on over 1 billion words of early modern English correspond to and diverge from what scholarly readers have already inferred? Is it possible to identify discourses around key ideas even when the apparently related key words are absent? Combining insights from the Keywords Project with tools developed by the Linguistic DNA project, this paper will explore how concept modelling can be applied to re-examine meaning in early modern texts.


Beyond Power Steering:
re-constituting structures of knowledge in 17th-century texts

John Regan (University of Cambridge)

One of the axioms of the Cambridge Concept Lab is that digital means of enquiry should provide qualitatively new kinds of knowledge, if we are to realise their full value. This is to say, that computation should not merely provide ‘power steering for the humanities’, but allow one to discover something different in kind about how knowledge was structured in the past.

Making good on this axiom necessitates judgements on the part of the user of digital technology about how to design one’s modes of address to (for example) natural language data sets such as Early English Books Online- TCP, in order that one is not only adding ‘power steering’ to existing, familiar types of enquiry. It also necessitates making decisions about when to come to rest at results (that is, when to cease enquiry); judgements of where digital data can be said to be producing discrete and unfamiliar forms of knowledge.

This paper will present tentative first signs of what the Cambridge Concept Lab believe are historically-discrete conceptual structures, based on data from the early seventeenth-century portion of EEBO-TCP. Two such structures will be described, one entitled ‘Mutual Dependence’, the other ‘Self-Consistency’. As will be shown, familiar forms of knowledge that are held and expressed in sentences and paragraphs, organised by grammar and understood by readers largely as explicit sense, may be contrasted with this evidence of qualitatively different conceptual structures in the textual record. While this paper does not set out to debunk existing theories of the structuration of knowledge and its transmission in the seventeenth century as have become established through centuries of close reading, it does seek to enrich our understanding of these traditions by attending to conceptual, and not exclusively semantic, thematic or rhetorical, structures.

It appears uncontroversial to assert that concepts are determining with regard to features of language use such as explicit and implicit semantic fields, theme, word order, and syntactic relations at the level of the sentence. Nevertheless, recognising that concepts have lexical and semantic extension is not the same as accepting that the two are identical in kind. This paper’s claims about conceptual structure will be based upon evidence from the early decades of seventeenth-century data from EEBO-TCP.


Our afternoon panel is a little depleted (by ill-health) but features Jose M. Cree (Sheffield) on Neologisms and the English reformation, Lucas van der Deijl (Amsterdam) on The collaborative Dutch translations of Descartes by Jan Hendrik Glazemaker (1620-1682), and a little extra time for discussion.


DROP-IN SESSION

All SRS delegates are very welcome to drop in to our demo workshop, where we will be providing a 10-15-minute introduction to our tools (3:30pm, repeated at 4:30pm) and the opportunity for hands-on experimentation.  This is in the Hicks Building, Floor G, room 29. (About 2 minutes walk from Jessop West, across the main road and a little uphill. Directions.)

Snapshot from campus map, featuring the Hicks Building.

Talk About Change

In a time when events seem ever and ever out of our control, writing is resistance.
–Our Mel.

In April (2018), Linguistic DNA began collaborating with local social entrepreneurs Our Mel to do some collective thinking about the power of language. This work is funded by the University of Sheffield’s Festival of the Mind and our work together will culminate in a spoken-word performance in the Festival’s Spiegeltent (pictured) this September.

The collaboration also involves 500 Reformations: exploring stories of change, from 1517 to 2018, a University of Sheffield public engagement project headed up by Linguistic DNA researcher Iona Hine.

Together, our goal is “TALK ABOUT CHANGE”.

More specifically Talk About Change is pursuing conversations about the history and power of language, particularly as experienced by people of colour. The first sessions will incorporate a provocation based on historical research, working through themes including diversity, feminisms, race, and resilience. Talking, sharing, debating, we hope participants will join us and engage in acts of creative resistance—in thought, speech, and writing.

What are we actually doing?

Throughout July and August, novelist and creative writer Désirée Reynolds will be leading a series of workshops, hosted by Our Mel, to discuss words and themes including race, feminisms, and diversity. The July workshops are themed and will each include input from a University of Sheffield researcher. The August workshops continue to explore related ideas, developing creative writing under the common heading “writing is resistance”.

Those who choose may publish their writing in an anthology, and we will also present a collective spoken-word performance (optional!) on Sunday 23 September as part of the Festival of the Mind programme.

Who can participate?

Our Mel issue a collective invite to come along and engage in conversation about “words that affect us every day”. What have they meant, how are they used, and what do they mean to us?

People of all ethnicities are welcome and an embracement of heritage is welcomed. Participation is limited to over 18s.

Visit Our Mel’s website (ourmel.org.uk) for more information about the workshops.


ABOUT THE COLLABORATORS


OUR MEL

Logo for Melanin Fest

Rooted in Yorkshire and based in Sheffield, OUR MEL is a social enterprise dedicated to exploring cultural identity, Black history and what it means to be a person of colour in Britain today. Inspired by two local lasses (Annalisa Toccara & Gabriela Thompson-Menanteaux) on a journey of self-love, Our Mel was born in November 2016 over a pack of caramel biscuits and a cup of tea, Yorkshire of course. Since its birth, Our Mel has grown into a community of people on a mission to support, encourage, teach and build the community through music, film, arts and education. In October 2017, we launched Sheffield’s first collaborative Black History month festival, MelaninFest, and its sister MelaninFest in London. 1300 people attended 43 events in Sheffield and 5 in London. Our Mel has been at the forefront of creating diversity, inclusion and representation in Sheffield since November 2016, working in collaboration with festivals and organisations both nationally and internationally. ourmel.org.uk  @our__mel


ANNALISA TOCCARA is a Marketer & PR professional, Community Activist & Creative Director. Based in Sheffield and founder of the social enterprise Our Mel, Annalisa launched Sheffield’s first Black History Month Festival; MelaninFest® in October 2017, which saw a total of 43 events spread across the month in collaboration with over 40 partners and also launched a sister festival in London. Since then, Annalisa has hosted a number of community events celebrating Black excellence, Black talent and Womanhood. Through her work with Our Mel and previous social justice endeavours, she has developed a passion for arts and culture having seen first-hand how creative mediums can help shape and create social cohesion within our community. Annalisa also has a BA (Hons) in Biblical Study and Applied Theology with a Diploma in Leadership and is currently studying for her Chartered Marketer status. She is also the Vice-Chair of the BAMER Hub – Sheffield’s Equality Hub Network. ourmel.org.uk  @sparklelikegold


DÉSIRÉE REYNOLDS started her writing career in South London as a freelance journalist for the Jamaica Gleaner and the Village Voice. She has since written film scripts, poetry and short stories. Some of her shorts are published on SABLE E-Mag and various anthologies. “Seduce” her first novel was published by Peepal Tree Press in 2013, to much acclaim. She continues to work as journalist, teacher, broadcaster and DJ. Desiree is currently working on a collection of short stories, a novel based on the Haitian revolution and her PhD. — “After spending a lot of time, doing lots of things, I’m finally where I’m supposed to be, doing what I’m supposed to do.”
desireereynolds.co.uk  peepaltreepress.com/authors/desiree-reynolds
youtu.be/qkNrQ-HMwLs  peepaltreepress.com/books/closure
@desreereynolds


500 REFORMATIONS

500 REFORMATIONS collaborates with external partners to explore and tell stories of change, from the cultural to the personal. Based at the University of Sheffield, 500 Reformations draws on research from across the Faculty of Arts and Humanities. Activities are united by the theme of reformation, whether writ big (as e.g. churches breaking away from Roman Catholic control in the sixteenth century, ‘the Reformation’) or small (in individual stories of change, development and re-form). 500reformations.group.shef.ac.uk @500Reformations

Translation, Gender, Sexuality: a report from Genealogies of Knowledge 2017

In December 2017, Sheffield MA student Nathaniel Dziura attended part of the Genealogies of Knowledge conference in Manchester. While the LDNA team were exchanging conceptual insights with other data-driven scholars, Nathaniel participated in sessions connected to a different field of interest. He writes:

As a member of the LGBTQ+ community, I am keen to contribute to research on how social factors impact language use, particularly gender and sexuality. As a second-generation Polish immigrant, raised with influence from both Polish and English culture, I am also very interested in the effect cultural background can have on the production of linguistic features.

Next year, I hope to start a PhD focused on this interplay between social and linguistic elements. Schumann (1978) suggested that the degree of ‘acculturation’ influences use of non-standard variants in second language learners. In other words, if the speaker is more immersed in the culture of their second language, they will be more likely to acquire native speaker-like linguistic variation. However, previous studies have not considered how other social factors such as sexuality might affect which features are acquired. This is despite previous studies having shown certain linguistic features to be cross-culturally associated with LGBTQ+ membership. These features include fronted-/s/ (Levon, 2006; Pharao et al., 2014) – colloquially stereotyped as the ‘gay lisp’ – and creaky-voice (Zimman, 2013: 3) – speaking with a low elongated ‘creak’, like a stereotypical ‘valley girl’. LGBTQ+ people do not inherently use these features, but they can play an important part in interaction (Barrett, 2017: 9).

I want to help fill this gap in the research by investigating how sexuality might affect the linguistic variants acquired in English by second language speakers (specifically, Polish migrants to England). I will examine whether the use of these features differs depending on two variables: the level of integration into British culture. And the level of involvement with the LGBTQ+ community.

This was the project I had in mind as I headed to Manchester for the conference. I was rewarded by an excellent thematic session on ‘Translation, Gender, Sexuality’.

I found Przemysław Uściński and Agnieszka Pantuchowicz’s presentations to be pertinent and insightful. Uściński’s talk focused on the downfalls with approaching Queer Theory in Poland from a ‘Western perspective’. The political environments in Poland and England have differed historically, and continue to do so. Uściński argues that ‘LGBT emancipation’ has not yet occurred in Poland. Critical theorisations of gender are intentionally scarce in Polish academic discourse. The reception of Queer Theory in academia has been comparatively belated, and has sometimes discredited the LGBTQ+ movement. British society has its share of problems with LGBTQ+-phobia. Yet, Poland has seen much far-right and religious rejection of the LGBTQ+ community. These groups have dismissed LGBTQ+ identities as ‘Western secular propaganda’ and ‘gender ideology’. So, English translations of concepts within Queer Theory, which are gradually being introduced to Polish academic works, reflect English notions and societal progress. Even when concepts from Queer Theory enter Polish, there is no possibility for their dissemination within Polish society. Queer Theory tends to be viewed as a ‘foreign’ and subversive concept. A theoretical importation into Polish from English, and not one congruous with Polish culture.

In another paper, Pauline Henry-Tierney noted that misinterpretations in translation of Beauvoir’s ‘Mauvaise Foi’ have slowed academic progress on the subject. Taking this into account, perhaps misinterpretations of Queer Theory as a ‘foreign’ concept to Poland are hindering the normalisation of LGBTQ+ concepts and perpetuate their perception as something radical and provocative.

This thematic session highlighted that introducing concepts into a language through translation can be a step towards spreading those ideas within another culture. However, this alone might not be enough to achieve society’s understanding and acceptance of those concepts. The translation of Queer Theory between cultures was not an issue I had previously considered. This thematic session reinforced that the political and social environments in Polish and English culture exhibit stark differences. This is significant within the framework of acculturation: LGBTQ+ community membership is arguably more accepted in British culture, and consequently so are associated non-standard language features. So one might predict that LGBTQ+ Polish migrants to England who become more British-acculturated are more likely to produce non-standard features associated with LGBTQ+-community membership than those who are less British-acculturated.

Overall, I was able to interact with academics from areas such as translation studies and politics with whom I would not otherwise be able to network. I am very grateful to the Linguistic DNA team for inviting me to attend the conference. The insights it has given me will be useful in my academic pursuits!


Featured image:
Jaap Verheul (Utrecht) presents an example from ShiCo research at the Genealogies of Knowledge conference, 8 December. Photo (c) I.C. Hine.


References:

Barrett, R. (2017) From Drag Queens to Leathermen: Language, Gender, and Gay Male Subcultures (Studies in Language Gender and Sexuality) Oxford: Oxford University Press

Henry-Tierney, P. (2017) ‘Translating in ‘Bad Faith’? Articulations of Beauvoir’s ‘Mauvaise Foi’ in English’, Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Levon, E. (2006) ‘HEARING “GAY”: PROSODY, INTERPRETATION, AND THE AFFECTIVE JUDGMENTS OF MEN’S SPEECH’ American Speech 81 (1): 56–78

Pantuchowicz, A. (2017) ‘Translation and the Failure of Gender Mainstreaming in Poland’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Pharao, N., M. Maegaard, J. S. Møller & T. Kristiansen (2014) ‘Indexical meanings of [s] among Copenhagen youth: Social perception of a phonetic variant in different prosodic contexts’ Language in Society 43, 1–31

Schumann, J. H. (1986). Research on the acculturation model for second language acquisition. Journal of Multilingual and Multicultural Development, 7, 379-392

Uściński, P. (2017) ‘Thinking Sexuality/Translating Politics: Queerness in(to) Polish’ Genealogies of Knowledge I: Translating Political and Scientific Thought across Time and Space, Manchester: University of Manchester

Zimman, L. (2013) ‘Hegemonic masculinity and the variability of gay-sounding speech: The perceived sexuality of transgender men’ Journal of Language & Sexuality 2 (1): 1-39


Seth and Iona present a joint paper with LDNA data at Genealogies of Knowledge. Photos (c) Japp Verheul.

Quantity and quality: lessons from an MA work placement

Sheffield MA student Nadia Filippi reflects on her experience after 100 hours with the Linguistic DNA team at DHI | Sheffield:


As part of my MA studies in English Language and Linguistics, I had the opportunity to undertake a work placement of 100 hours at the University of Sheffield’s Digital Humanities Institute. The placement offered a good overview on the typical tasks and responsibilities of a researcher and was an excellent choice for me because I am interested in doing research and I am considering going onto PhD research.

When registering for the placement module, I only had basic knowledge of corpus linguistics. I was accustomed to qualitative research but wanted to discover quantitative methodologies and the possibilities that quantitative research can offer. Starting my placement, I was at a stage in my studies in which I was still looking for definite answers to all my questions about research. Moreover, I respected everything to do with numbers, but the idea of actually ‘doing statistics’ made me nervous. I consciously chose a placement to force myself out of my qualitative comfort zone.

My concerns resolved themselves during the placement. I had to familiarise myself with and use statistical software packages like SPSS and lost my initial fear. I began to understand how statistics could be used effectively to discuss questions and find information that qualitative research could not do in timely manner. For example, finding out which words frequently co-occur in a large dataset. Furthermore, I came to understand that doing research does not exclusively mean to narrowly focus on finding a clear answer to an initial research question. It is often more about refining the question, developing another one and accepting that there can be more than one right answer to it.

The power of the Digital Humanities Institute lies in quantitative analysis, engaging with statistical distribution, auditing datasets and computational methods. Yet, there is still qualitative work to do. For instance, I audited and reported on qualities of the YouTube dataset, wrote summaries of previous research and searched for suitable approaches or tools (e.g. a Part-Of-Speech tagger suited to social media data), by consulting published research from similar projects.

A YouTube Convert

It turned out that the placement as a whole, the experiences I made and the tasks I was given shaped my other studies. At the beginning of my placement, the Linguistic DNA team had just started providing support for the Militarization 2.0 project, in collaboration with the University of Leeds. I was immediately drawn-in by this study of YouTube gaming discussion and it ultimately gave me an idea for my MA dissertation.

I had the chance to look through some of the 6.7m YouTube comments gathered by Nick Robinson and his team at the University of Leeds, and think through how they might be analysed for concept modelling.

Screenshot showing comments on Battlefield 1 official trailer, via YouTube (15 May 2017). https://www.youtube.com/watch?v=c7nRTF2SowQ

In exploring the comments, I had to consider the characteristics of commenters’ language and reflect on the research questions. Gaming language, for example, is filled with specialist abbreviations such as “CoD:ww2”, which stands for the game Call of Duty: World at War 2. Information about nationalities (“the Germans”) and militarised language (“disabled”, “destroyed”) may also be key to answering questions about how users’ remarks connect with video content. Close reading of excerpts helps to inform how the Sheffield team respond to the main interests of the mother project Militarization 2.0: if and how social media is militarized and what effect that has on our society and the individual citizens.

By attending meetings, I gained insights into the process and decision-making in a big research project. This included, for example:

  • preparing big data (should we standardise the spelling of the comments or not?)
  • practical obstacles, such as YouTube’s technical limitations (which prevent us from retrieving all the answers to a specific comment)
  • deciding which variables to include (time, author, number of likes)
  • time and scope (how can the resources available be matched to the aims and desired outcomes of a project?)

Knowing the kinds of challenges that such a project can face was helpful in planning my dissertation, which I will be writing over the summer. Prompted by the DHI’s YouTube work, my research will discuss the kind of language generated by exposure to military video game trailers and investigate if there is a difference between the language produced online and offline. In undertaking this research, I will work with my own corpus of YouTube comments as well as with focus groups. The qualitative aspect of my dissertation will allow me to explicitly address and discuss the violence in these game trailers within my focus groups.

Overall, the work placement has been one of the most valuable and enjoyable modules of my MA. I developed many new skills, academically as well as personally. I am more confident about quantitative approaches and numbers, as well as the importance of humanities research as a whole.


Top image shows Sheffield MA student Nadia Filippi at the Linguistic DNA and Militarization 2.0 stand at the 2017 Festival of Arts & Humanities Showcase, Sheffield. The showcase was “a fantastic opportunity to open a dialogue about humanities research and its impact with the public”.

Looking back, looking forward: Linguistic DNA in 2016 and 2017

As we move into 2017, we’ve been looking back at achievements in 2016, and ahead to what we aim to achieve in the coming year.

2016 was an outwardly busy year as we travelled to Bruges, Essen, Krakow, Lausanne, Leeds, Brighton, Murcia, Nottingham, Paris, Saarbrucken, and Utrecht, sharing more of our thinking and early data with different audiences. Closer to “home”, we benefitted from the exchange of ideas with LDNA-hosted panels at Sheffield DH Congress and our second methodological workshop in Sussex. In 2017, we will be focusing back on our interface development and some more in-depth research, though we intend to be present at DH, SHEL, ICAME and SHARP, in order to continue some fruitful conversations.

On the blog, we have been reflecting on representativeness and the nature of EEBO-TCP. We’ve also documented our decision not to use ECCO’s OCR data to analyse eighteenth century print. You can expect to hear about the alternative 18th century datasets we’re choosing to work with later in 2017.

During the Autumn, the LDNA researchers collaborated on two articles about the project, its theory and praxis, both (hopefully) to be published this year following peer review. Generating examples from each research theme based on our early data and tying these together effectively was an enjoyable challenge, and we have already used the draft of one piece as part of our briefing materials for upcoming MA placements at The Digital Humanities Institute | Sheffield (formerly known as HRI Digital).

In the past six months, the Sheffield team have captured funding for two additional applications of the Linguistic DNA “concept modelling” tools:

  • The ESRC project Ways of Being in a Digital Age combines our quantitative insights with a qualitative literature survey of academic publications. Scheduled to inform the ESRC’s next programme of digital society funding, this impact-full study has compelled us toward rapid prototype development. The interface being put together to serve ‘WoBDA’ colleagues will also form the kernel of the subsequent LDNA workbench.
  • From next month, we are involved in another funded impact-related project, collaborating with the University of Leeds to explore the conceptual structure of millions of YouTube video comments on the theme of militarisation, as part of a larger project funded by the Swedish Research Council. This is a six-month commitment, bringing in a further research associate to theorise what’s involved in applying our measures to some very different data.

We also have three significant applications in place for other pots of funding, including Horizon 2020 collaborations, attesting confidence about our nascent processes and the multifarious opportunities for their application and impact.

Meanwhile, Glasgow has been using the present word co-occurrence data to develop its methodology for investigating processor data from the perspective of key Historical Thesaurus categories. We have continued to develop analysis of Thesaurus categories, looking for those which show abnormal instances of growth or decline; a provisional methodology for establishing statistical ‘baselines’ has been plotted out which is now being implemented and refined. Further possibilities are being tested, such as amalgamating data across whole layers of the HT hierarchy rather than by individual category, and the effects of separating out part of speech within categories or layers.

Lost Books

On “Lost Books” (ed. Bruni & Pettegree)

Review: Lost Books: Reconstructing the Print World of Pre-Industrial Europe. Ed. Flavia Bruni and Andrew Pettegree. Library of the Written Word 46 / The Handpress World 34. Leiden & Boston: Brill, 2016. 523 pages.


We solicited this book for review because we have been keenly aware that we cannot take what has been transcribed and preserved through the digitisation processes of Early English Books Online and the Text Creation Partnership as an accurate indication of all the material that was printed in the early modern period. Setting aside the idiosyncrasies of selectivity in the composition of EEBO-TCP, which have been documented elsewhere, there is a prior ‘selectivity’ about what survived to be catalogued.

The volume collects together the proceedings of the Lost Books conference held at the University of St Andrews in June 2014, and divisions within the volume loosely reflect those of the original call.

Pettegree’s introduction, “The Legion of the Lost” is a full-length essay discussing not only how books become lost but how one can know about what has been lost. It is accessible and engaging and would be a worthy reading assignment for undergraduates or masters students studying book history. As observed in a prior blogpost, “While the chapter … performs the function of uniting what follows, and does at times point to specific contents in the coming chapters, there is nothing of the clunkiness that one sometimes observes in the introduction of an edited collection.”

The two essays that follow both approach the challenge of assessing the loss of incunabula, i.e. print materials from pre-1500. Falk Eisermann begins with a comparison of the listings in the Gesamtkatalog der Wiegendrucke with the Incunabula Short Title Catalogue. He probes possible methods for distinguishing items that were printed (and lost) from items never-printed, giving examples from archival sources that defy expectation: “lost editions by unknown printers (sometimes located in incunabulistic ghost towns), containing texts not preserved anywhere else, even representing works of hitherto unrecorded authors” (43). The book historians’ task, one may imagine, is an uphill struggle; optimistically, there is fresh work to be done as no one has yet analysed the customary discussion of other printed works in paratext “with regard to dark matter” (50). Jonathan Green and Frank McIntyre (Chapter 3) aim to quantify the losses, offering an open discussion of the pitfalls of particular statistical approaches to this question. They recommend modelling the counts of surviving copies as a negative binomial distribution, accommodating correlation in loss and survival. For—and this is significant to LDNA—“books are not preserved or destroyed independently of each other” (59). Small items are more likely to survive if bound together; volumes in a library often share a common destiny. In addition, taste is a cultural construct with ideas of fashion and significance affecting more than one owner’s decision to dispose of or conserve. Taking into account variations of format, Green and McIntyre suggest that as much as 30 per cent of Quarto editions may have been lost entirely, comparing with 60 per cent of Broadsides and 15 per cent of Folios.

Part 2 is composed of national case-studies covering vocal scores from Renaissance Spain (Chapter 4, showing a markedly persistent repertoire conserved by copying when required); evidence of book ownership and circulation in pre-Reformation Scandinavia (Chapter 5, conducted with the help of inventories); the meticulous reconstruction of a lost Polish original on the basis of later editions (Chapter 6, touching also on the circulation of fortune-telling books throughout early modern Europe); a study of the Stationers’ Company Register (Chapter 7); a sheet-count- based model for calculating loss of seventeenth-century materials based on records for the Southern Netherlands—using the metadata-rich STCV, which also positions title-page engraving and roman typeface as features positively correlated with survival (Chapter 8); the identification of patterns of loss using book advertisements from the Dutch Republic (Chapter 9—exposing partly the proliferation of multiple localised editions); and a report weaving together a census of seventeenth-century Sicilian printing activity with a legal dispute over the library of Francesco Branciforti, attesting strong local attachment to this private collection (Chapter 10).

In Part 3, Christine Bénévent and Malcolm Walsby revisit the publication history of Guillaume Budé’s apophthegms (Chapter 11), combining careful study of the layout to demonstrate Gazeau’s compositor pretended to a new edition by replacing the first quire, with a call not to dismiss the “intellectual value” of later editions, noting the Paris copy of De L’Institution du Prince had the highest survival rate and was owned “by the most influential and powerful in early modern Europe” (including Edward VI, 252). Michele Camaioni aims to reconstruct a censored (but popular) mystical text using its censorship record (Chapter 12). Three further chapters draw on data from the RICI project, a study of Italian religious orders’ book ownership based on a Vatican-led census: Rosa Marisa Borraccini documents Girolamo de Palermo’s “unknown best-seller”, a devotional work running to “plausibly . . . more than one hundred” editions (Chapter 13); Roberto Rusconi probes weaknesses in the cataloguing, involving misspelt transcriptions, inadequate shorthand (opera omnia, etc.) and perhaps the deliberate disguising of works by disapproved authors (Chapter 14); and Giovanni Granata attempts to merge statistical extrapolation of lost works with study of specific lost editions based on the bibliographic records produced by the census (Chapter 15).

Part 4 is dedicated to lost libraries. Anna Giulia Cavagna observes the motives of Alfonso del Carretto, an exiled monarch whose self-catalogued collection prioritised texts pertaining (mostly through paratext such as dedications) to people whose powerful patronage he wished to secure, revealing books as “vectors of social relations” (357, Chapter 16). Martine Julia van Ittersum pursues the preservation and loss of Hugo Grotius’ personal collections, observing that preservation required “neglect, though not too much of it” (384) and that the preservation of printed materials was correlated with loss of manuscript (Chapter 17). Federico Cesi, the target of Maria Teresa Biagetti’s study (Chapter 18), was the founder of the Accademia dei Lincei in Rome; his now dispersed collection included works of botany, zoology, alchemy, and medical texts, its components known through correspondence and post mortem inventory. Sir Hans Sloane’s collections, including printed books “estimated at about 45,000 volumes”, formed the kernel of what is now the British Library; Alison Walker explains the difficulties of tracing Sloane’s books, which when duplicated by other collections were often dispersed through sale or gifting, or migrated at the creation of new specialist institutions such as the National History Museum. By reconstructing the collection, Walker argues, one may attain a “reflection . . . of the intellectual environment of the day” and of “Sloane himself as a scientist and physician” (412, Chapter 19). The last chapter in Part 4 outlines the hopes of the AHRC research network ‘Community Libraries: Connecting Readers in the Atlantic World’, using a case study from Wigtown (NW Scotland) to show how archival resources about the creation and use of libraries yield insight into sociability (Chapter 20); we find widows borrowing while patrons gain more from the bureaucracy and facilitation than the library’s holdings.

The last section (Part 5), entitled “War and Peace”, considers the woes that have befallen historic collections in more recent times. Jan Alessandrini discusses Hamburg’s collections, protection measures during the Second World, the seizure of private Jewish libraries, and the political challenges of reconstruction (with some prospect of help from Russian digitisation, Chapter 21). Tomasz Nastulczyk acknowledges that “Swedish pillaging paradoxically helped to preserve” books from the Polish-Lithuanian Commonwealth that might otherwise have been lost (462, Chapter 22). Co-editor Flavia Bruni writes of the successful preservation of Italian archives and libraries aided by “a clear and centralised policy” in WW2, arguing that “international agreements” are also essential if cultural heritage is to be preserved (484, Chapter 23). The closing chapter is devoted to broadsheet ordinances, lost—or perhaps missing—as a result of the collapse of Cologne city archives in 2009; happily, microfilm means all is not lost, and Saskia Limbach also successfully traces invoices and other evidence of print activity through a range of archival sources (Chapter 24).

It will be evident from this account that the case studies are drawn from across Europe, with three chapters directly addressing British material. Of these, Alexandra Hill’s intersects most closely with the period Linguistic DNA has focused on so far, with the Register containing “with some exceptions [e.g. government publications and school books], . . . all the books authorised to be printed during the Elizabethan, Jacobean and early Caroline periods” (144–5). Comparing this information with the English Short Title Catalogue, Hill shows that for the 1590s, the survival rate of fiction and ballads is significantly lower than other genres of publication; in addition, within a relatively well-preserved domain such as religious literature, subcategories may fare disproportionately badly as is the case for prayer books, destroyed—Hill hypothesises—by continual use. These kinds of absences need to be borne in mind as we proceed to analyse the survivors. Of course, given the cultural traffic of early modern Europe, much of what is learned from non-British collections is also relevant for thinking critically about how texts survived, how others were lost, and how Linguistic DNA should correspondingly limit the claims built on the print discourse of EEBO-TCP.


As of summer 2018, Lost Books is now open access, and freely available online for all to read.

The Edge

LDNA at Digital Humanities Congress 2016, Sheffield

LDNA organised two panels at the 2016 Digital Humanities Congress (DHC; Sheffield, 8th-10th September. Both focused on text analytics, with the first adopting the theme ‘Between numbers and words’, and the second ‘Identifying complex meanings in historical texts’. Fraser reports:


Continue reading

What does EEBO represent? Part II: Corpus linguistics and representativeness

What exactly does EEBO represent? Is it representative?

Often, the question of whether a corpus or data set is representative is answered first by describing what the corpus does and does not contain. What does EEBO contain? As Iona Hine has explained here, EEBO contains Early Modern English, but it is much larger than that in some ways, and also much more limited than that. EEBO contains many languages other than English, which were printed in the British Isles (and beyond) between 1476 and 1700. But EEBO is also limited: it contains only print, whereas Early Modern English was also hand-written and spoken, across a large number of varieties.

Given that EEBO contains Early Modern print, does EEBO represent Early Modern print? In order to address this question meaningfully, it’s crucial first to define representativeness.

In corpus linguistics, as in other data sciences and in statistics, representativeness is a relationship that holds between a sample and a population. A sample represents a larger population if the sample was obtained rigorously and systematically in relation to a well-defined population. If the sample is not representative in this way, it is an arbitrary sample or a convenience sample – i.e. it was not obtained rigorously and systematically in relation to a well-defined population. Representativeness allows us to examine the sample and then draw conclusions about the population. This is a fundamental element of inferential statistics, which is used in data science from epidemiology to corpus linguistics.

Was EEBO sampled systematically and rigorously in relation to a well-defined population? Not at all. EEBO was sampled arbitrarily, by convenience – first, including only texts that have (arbitrarily) survived; then including texts that were (arbitrarily) available for scanning and transcription; and, finally, including those texts that were (arbitrarily) of interest to scholars involved with EEBO at the time. Could we, perhaps, argue that EEBO represents Early Modern print that survived until the 21st century, was available for scanning and transcription, and (in many cases) was of interest to scholars involved with the project at the time? I think we would have to concede that EEBO wasn’t sampled systematically and rigorously in relation to that definition, and that the arbitrary elements of that population render it ill-defined.

So, what does EEBO represent? Nothing at all.

It’s difficult, therefore, to test research questions using inferential statistics. For example, we might be interested in asking: Do preferences for the near-synonyms civil, public, and civic change over time in Early Modern print? We can pursue such a question in a straightforward way, looking at frequencies of each word over time, in context, to see if there are changes in use, with each word rising or falling in frequency. In fact, we can quite reliably discern what happens to these preferences within EEBO. But our question, as stated, was about Early Modern print. It is the quantitative step from the sample (EEBO) to the population (Early Modern print) that is problematic. Suppose that we do find a shifting preference for each of these words over time. Because EEBO doesn’t represent the population of Early Modern print in any clear way, we can’t rely on statistics to conclude that that this is in fact a correlation between preferences and time – or if it is, instead, an artefact of the arbitrariness of the sampling. The observation might be due to any number of textual or sociolinguistic variables that were left undefined in our arbitrary sample – including variation in topics, or genres, or authorial style, or even authors’ gender, age, education, or geographic profile.

It as though we were testing children’s medication on an arbitrary group of people who happened to be walking past the hospital on a given day. That’s clearly a problem. We want to be sure that children’s medication was tested on children – but not simply children, because we also want to be sure that it isn’t tested on children arbitrarily sampled, for example, from an elite after-school athletics programme for 9-year-olds that happens to be adjacent to the hospital. We want the medication to be tested on a systematic cross-section of children, or on a group of children that we know is composed of more and less healthy kids across a defined age range, so that we can draw conclusions about all children, based on our sample. If we use a statistical analysis of EEBO (an arbitrary sample) to draw conclusions about Early Modern print (a population), it’s as though we’re using an arbitrary sample of available kids to prove that a medication is safe for the population of all kids. (Linguistics is a lot safer than epidemiology.)

If one were interested in reliably representing extant Early Modern print, one might design a representative sample in various ways. It would be possible to systematically identify genres or topics or even text lengths, and ensure that all were sampled. If we took on such a project, we might want to ensure sampling all genders, education levels, and so on (indeed, historical English corpora such as the Corpus of English Dialogues, or ARCHER, are systematically sampled in clear ways). We would need to take decisions about proportionality – if we’re interested in comparing the writing of men and women, for example, we might want large, equal samples of each group. But if we wanted proportional representation across the entire population of writers, we might include a majority of men, with a small proportion of women – reflecting the bias in Early Modern publishing. Or, we might go further and attempt to represent not the bias in Early Modern publication, but instead the bias in Early Modern reception, attempting to represent how many readers actually read women’s works compared to men’s works (though such metadata isn’t readily available, and obtaining it would be a project in itself). Each of these decisions might be appropriate for different purposes.

So, what are we to do? LDNA hasn’t thrown stats out the window, nor have we thrown EEBO out the window. But we are careful to remember that our statistics are describing EEBO rather than indicating conclusions about a broader population. And we haven’t stopped there – we will draw conclusions about Early Modern print, but not via statistics, and not simply via the sample that is EEBO. Instead, we will draw such conclusions as close readers, linguists, philologists, and historians. We will use qualitative tools and historical, social, cultural, political, and economic insights about Early Modern history, in systematic and rigorous ways. Our intention is to read texts and contexts, and to evaluate those contexts in relation to our own knowledge about history, society, and culture. In other words, we are taking a principled interpretive leap from EEBO to Early Modern print. That leap is necessary, because there’s no inherent representative connection between the two.

Under the surface: SHARP, LDNA and sundry sources

This blog post excerpts material Iona wrote reflecting back on her contribution to the SHARP conference in Paris in July 2016, building on the work of her PhD thesis and incorporating material and processes that have formed part of the Linguistic DNA project. The full post can be found on Iona’s personal blog.


In preparation for the paper, I dedicated time to manually extract, compile and refine measurements for some of the early outputs from the LDNA processor. To fit in with the pledges of my abstract, I targeted the associations of valour and valiant in subsets of EEBO-TCP.

During my PhD, I used EEBO-TCP to provide context for my work with early modern bibles. Valour entered the equation as I examined trends in the translation of a Hebrew collocation gibbor chayil. In the King James Version (publ. 1611) most gibbor chayil men are “mighty . . . of valour”. The repetition of this phrase across the translation means that English bible readers could form associations between the group of characters referred to, in a similar manner to those who encounter the Hebrew narrative directly. For this to happen in translation shows that the translators recognised and (sometimes) prioritised the transmission of this connection; in this respect “mighty of valour” is a partial example of a larger trend in favour of a more technical approach to translation, a move likely influenced by the increasing use of precise cross-referencing in bible reading (facilitated by the introduction of verse numbers throughout the Bible, an innovation of the 1550s). Yet the phrase is intrinsically interesting because before that “valour” was not part of the English biblical lexicon.

Collating instances of gibbor chayil demonstrates that the lexically related “valiant” was used in earlier translations, but in a piecemeal manner (illustrated by the changing distribution of black square bullets in the diagram below).

Poster

This diagram, extracted from my SHARP presentation, is one of a series colour-coded to highlight consistency within individual versions with a focus on the characterisation of Boaz. The black square bullets are added to highlight where a form of ‘valiant’ (or for KJ ‘valour’) was used.

By exploring the words valiant and valour with the LDNA tools, I was able to corroborate the impression I had formed during my earlier quantitative and qualitative analysis which was conducted via a standard EEBO-TCP interface.

The PhD bit

Searching hits in the population for the first century of English print (to 1570) and comparing that with the next half century (a collection of documents three times the size) I had observed that the frequency of both valiant and valour increased markedly above expectation.

Poster

Comparison of word frequency (hits) and distribution (records, hits per record) in EEBO-TCP for 1473-1570 (P1) and 1571-1620 (P2) expressed in ratios.

Scrutinising the data by decade exposed some significant textual influences. To quote from my thesis:

87 per cent of occurrences of “valiant” in the corpus for 1520-1529 (316 of a total 363) appear in a two-volume translation of the French chronicles of Froissart, while two other translated works account for a further 9 per cent; just 4 per cent of hits occur in ‘indigenous’ texts.

For “valour”,

a jump in the decade 1570-1579 is significantly related to the publication in 1579 of a translation from Italian: 403 of the decade’s 501 hits appear in a one-volume translation of The historie of Guicciardin conteining the vvarres of Italie and other partes (London, 1559). Once such scrutiny is imposed, it becomes evident that translation had a significant role in the increased currency of these two Latinate terms. It is also evident that the words normally appear in certain genres: conduct books concerned with warfare and chivalric behaviour; and chronicles of past history. This contributes to the recognisable sense of valour as “The quality of mind which enables a person to face danger with boldness or firmness; courage or bravery, esp. as shown in warfare or conflict; valiancy, prowess.”[ OED s.v. “valour|valor, n.”, §1c.] This sense, cultivated through translation in the course of the sixteenth-century, fits the context in which King James’ translators employ the word.

The LDNA bit

The subsets of EEBO-TCP sent through the LDNA processor earlier in the year were intentionally compatible with the periodisation of my thesis, providing windows onto English discourse that could be cross-referenced with the publication of particular bibles. The subsets thus incorporate all transcribed material from EEBO (TCP update 2015) known to have been printed during the following spans:

  • 1520-1539 (cf. Coverdale Bible 1535, Matthew Bible 1537, Great Bible 1539)
  • 1550-1559 (Geneva Bible 1560, Bishops Bible 1568); and
  • 1610-1611 (Douai Old Testament 1609-10, King James Version 1611).

Taking the first and last of these, measuring PMI in windows of discourse around the word “valour”, we find marked change in the prominent associations. Our approach yields plentiful data, and we are still thinking through the challenges of visualisation. In the slide shown, I have coloured associated terms according to the innermost window in which the cooccurring lemma rises to prominence. Thus red terms occur frequently in the narrowest window around valour (+/-1 words), orange terms in the expanded window (+/-10 words) that might approximate the surrounding sentence, green for +/-50 words (which now form the default window size in our public interface) and blue for the wide discursive window of +/-100 words. (Many lemmas appear in more than one window, and the list shown for the later period does not reach to some relevant low frequency items such as “prowess”.)

Poster

What should be visible is a distinction between the use of “valour” as a synonym of value or worth (prominent in the 1520-1539 subset), and the association with conduct in conflict (dominant in the 1610-1611 dataset). Both senses were part of the Latin root “valeo” and, had King James’ translators ventured it, both could have been played upon to make even more “mighty men of valour” in 1611. (One of the exceptions comes at 2 Kings 15:20, where Menachem taxes all gibbor chayil men, “mighty men of wealth” in the KJV.)

Inevitably, the set of observations I could draw from this investigation are not part of the bottom-up process that LDNA strives to achieve. But the exercise has helped me to think through some different ways we will want to be able to interrogate our data and to study the effects of some different baselines for our expectation calculations. And it demonstrates, I think, the valour of conducting semantic enquiries through discursive windows.

_____

Notes

Thesis quotations are from: I. C. Hine, “Englishing the Bible in early modern Europe: The case of Ruth”, PhD thesis (University of Sheffield, 2014), p. 163. These numbers reflect searches conducted through the Chadwyck EEBO interface using its variant spelling option.

The datasets employed in my thesis are not quite identical to those used by the project: LDNA uses a slightly expanded version of the EEBO-TCP collection (last updated early 2015) with its spelling regularised and tokens lemmatised locally using MorphAdorner.

Select a language... option.

What does EEBO represent? Part I: sixteenth-century English

Ahead of the 2016 Sixteenth Century Conference, Linguistic DNA Research Associate Iona Hine reflected on the limits of what probing EEBO can teach us about sixteenth century English. This is the first of two posts addressing the common theme “What does EEBO represent?”


The 55 000 transcriptions that form EEBO-TCP are central to LDNA’s endeavour to study concepts and semantic change in early modern English. But do they really represent the “universe of English printed discourse”?

The easy answer is “no”. For several reasons:

As is well documented elsewhere, EEBO is not restricted to English-language texts (cf. e.g. Gadd).  Significant bodies of Latin and French documents printed in Britain have been transcribed, and one can browse through a list of other languages identified using ProQuest’s advanced search functionality. To this extent, EEBO represents more than the “universe of English printed discourse”.

But it also represents a limited “universe”. EEBO can only represent what survived to be catalogued. Its full image records represent individual copies. And its transcriptions represent a further subset of the survivals. As the RA currently occupied with reviewing Lost Books (eds. Bruni & Pettegree),* I have a keen awareness of the complex patterns of survival and loss. A prestigious reference work, the must-buy for ambitious libraries, might have a limited print run and yet was almost guaranteed survival–however much it was actively consulted. A popular textbook, priced for individual ownership, would have much higher rates of attrition: dog-eared, out-of-date, disposable. Survival favours genres, and there will be gaps in the English EEBO can represent.

The best function of the “universe” tagline is its emphasis on print. We have limited access to the oral cultures of the past, though as Cathy Shrank’s current project and the Corpus of English Dialogues demonstrate, there are constructions of orality within EEBO. Equally, where correspondence was set in print, correspondence forms a part of EEBO-TCP. There is diversity within EEBO, but it is an artefact that relies on the prior act of printing (and bibliography, microfilm, digitisation, transcription, to be sure). It will never represent what was not printed (and this will mean particular underprivileged Englishes are minimally visible).

There is another dimension of representativeness that matters for LDNA. Drawing on techniques from corpus linguistics makes us aware that in the past corpora, collections of texts produced in order to control the analysis of language-in-use, were compiled with considerable attention to the sampling and weighting of different text types. Those using them could be confident about what was in there (journalism? speech? novels?). Do we need that kind of familiarity to work confidently with EEBO-TCP? The question is great enough to warrant a separate post!

The points raised so far have focused on the whole of EEBO. There is an additional challenge when we consider how well EEBO can represent the sixteenth century. Of the ca. 55 000 texts in EEBO-TCP, only 4826 (less than 10 per cent) represent works printed between 1500 and 1599. If we operate with a broader definition, the ‘long sixteenth century’ and impose the limits of the Short Title Catalogue, the period 1470-1640 constitutes less than 25 per cent of EEBO-TCP (12 537 works). And some of those will be in Latin and French!

Of course, some sixteenth century items may be long texts–and the bulging document count of the 1640s is down to the transcription of several thousand short pamphlets and tracts–so that the true weighting of long-sixteenth-century-TCP may be more than the document figures indicate. Yet the statistics are sufficient to suggest we proceed with caution. While one could legitimately posit that the universe of English discourse was itself smaller in the sixteenth century–given the presence of Latin as scholarly lingua franca–it is equally the case that the evidence has had longer to go missing.

As a first post on the theme, this only touches the surface of the discussion about representativeness and limits. Other observations jostle for attention. (For example, diachronic analysis of EEBO material is often dependent on metadata that privileges the printing date, though that may be quite different from the date of composition. A sample investigation of translate‘s associations immediately uncovered a fourteenth-century bible preface printed in the 1550s, exposed by the recurrence of Middle English forms “shulen” and “hadden”.) Articulating and exploring what EEBO represents is a task of some complexity. Thank goodness we’ve another 20 months to achieve it!


* Read the full Linguistic DNA review here. The e-edition of Bruni & Pettegree’s volume became open access in 2018.

Digital Humanities 2016, Kraków

Conference reflections jointly written with Justyna Robinson

Four members of the LDNA team—Marc Alexander, Justyna Robinson, Brian Aitken, and Fraser Dallachy—attended this year’s Digital Humanities (DH) conference in Kraków, Poland. With over 800 attendees, the conference is an excellent opportunity to exchange ideas, learn of new areas of potential interest, and network with academics from around the world. The team presented a version of the project’s poster at the event (attached to this post), giving an overview of the project, the technical steps which have been taken so far, and introducing the research themes.

Digital methods of textual analysis are an important subject for the DH attendees, and there were several papers outlining approaches and results from such research. One of the most relevant of these for us was the paper by Glenn Roe et al. on identification of re-used text in Eighteenth Century Collections Online (ECCO). After eliminating re-printings of texts, this project used a specially developed tool which found repeated passages, indicating where an author had re-used their own or another’s words. The results are available and searchable on their website. In the same session, a team led by Monica Berti at Leipzig described a method of identifying and labelling fragments of text quoted from ancient Greek authors. These projects represent something like a parallel research track to ours, tracing the history of ideas through replication of passages rather than through more abstract word clusters. Early English Books Online (EEBO) also received some attention, with Daniel James Powell giving an overview of its history and importance to digital research on historical texts.

Discussion with other attendees at the poster session was especially productive, and resulted in several strong leads for the team to follow up. A subject which was mentioned to us repeatedly was that of topic modelling. Multiple panels were dedicated to the use of these methods to extract information about the contents of texts, an approach which LDNA has considered employing. The team at Saarland studying the Royal Society Corpus (with whom LDNA is already in contact) use topic modelling to study the development of scientific concepts and terminology. Their results were encouraging, allowing them to identify word groupings which represent scientific disciplines such as physiology, mechanical engineering, and metallurgy. Following these topics through time showed that the number of topics increases whilst their vocabulary becomes more specialised. Although LDNA has reservations about how useful topic modelling is for our purposes, the work being conducted at Saarland refines and implements its methodology in a way which we would seek to learn from if we do choose to pursue it further.

Poster

At the poster session

Visualising big data is of central interest to the LDNA project, especially in the context of the upcoming LDNA Visualisation Workshop. With this view in mind, we paid particular attention to projects that presented new and interesting ways of seeing large data. A number of presentations focused on network visualisations. These often link metadata, e.g. around social networks of royal societies or academies as based on letter correspondence. An interesting visualisation that present unstructured linguistic data was presented by the EPFL team. Vincent Buntinx, Cyril Bornet, and Frédéric Kaplan visualised lexical usage in 200 years of newspapers on a circle with the radial dimension representing the number of years a word has been in use, and the circumferential dimension showing a period of use of words. [1]

Stylometrics, with its interest in being able to identify and measure aspects of language which contribute to the impression of authorial style, produced some interesting papers. One of the common themes for stylometrics and other DH strands of research is the way concepts are operationalised.  The varied approaches to concepts taken by DH researchers were noticeable, for example, whether each noun can be considered to be a concept, or a concept can be defined as “a functional thing”. This suggests that the work on concept identification undertaken by the LDNA team will be of interest to the wider DH community. Also amongst the stylometric papers was a look at historical language change by Maciej Eder and Rafal Górski which used bootstrap consensus network analysis on part of speech (POS) tagged texts to contrast syntax and sentence structure between time periods. The paper used multidimensional scaling (MDS) to reduce POS tagged texts to a single value which could then be plotted against time, allowing them to show that a gradual change in the MDS results can be discerned between the earliest and latest texts. The paper both highlighted how useful a visualisation can be for identifying a change, and how difficult it can be to quantify exactly what the visualisation shows.

However, on a different but very important note, a strong theme of the conference was that of diversity, with a thread of panels discussing the different ways in which this subject is applicable to the digital humanities. From a personal point of view, I think LDNA has a strong awareness of both the scope and the limitations of our interests and approaches, (although we can never afford to be complacent). We’ve considered what our textual resources represent, and the RAs are soon to explore this subject from different angles in future blog posts. EEBO and other text collections are more expansive, inclusive, and diverse than prior research has been able to access, and this feels like a part of an enormously positive movement in academia to open up more and more data for new kinds of study. As extensive as our resources are, however, they still have limitations reflecting the (mostly Western, mostly white, mostly male, mostly middle-to-upper class) societal groups who were able to read, write, and print the words which ended up in these collections. The resources open to academia are continually growing, and hopefully this expanding diversity will open up ever more of the world’s knowledge to ever more of its population. Whilst the discussions at this conference have made clear that there is a long way to go in fully embracing diversity in the digital humanities, there are indications that the situation is improving, and it is incumbent upon us all to ensure that this continues.

For another view of the conference, Brian Aitken, Digital Humanities Research Officer at Glasgow, has written about his own experience on his blog.

———

1. Studying Linguistic Changes on 200 Years of Newspapers, Vincent Buntinx, Cyril Bornet, Frédéric Kaplan (EPFL (École polytechnique fédérale de Lausanne), Switzerland)

The Edge

Text Analytics at Sheffield DH Congress

Earlier in the year (2016), we issued a special call for papers, inviting others to join LDNA panel sessions at the Sheffield Digital Humanities Congress. We were delighted by the responses, and further delighted that the full DHC programme includes plenty of other material relevant to our text analytics’ interests–and a noticeable body of book historical input too.

As a special privilege for those who follow the LDNA blog, here are two bonus abstracts outlining our conception of each LDNA panel:

TA 1: Between numbers and words

Session 4, Friday 9 September
ft. Hine, Shute, Siirtola et al.

Digitisation of texts facilitates kinds of statistical analysis that were previously difficult and perhaps impossible for humans to carry out. This series of papers explores the interface between statistics and close reading, teasing out how these modes of textual analysis can be applied jointly to explore and analyse the material, lexical and semantic form of constitutent texts. We discuss the use of quantitative analysis to reassess hypotheses about the work of compositors in fifteenth-century printing. We scrutinise a blueprint for moving between statistical data and words-in-context within collections too big for human reading (with special attention to concept formation). Lastly, we demonstrate how one newly-enhanced visualisation tool assists exploratory analysis to generate insights about genre and social variables in digital text collections including early modern correspondence and international Englishes.

TA 2: Identifying complex meanings in historical texts

Session 7, Friday 9 September
ft. Mehl, Recchia, Makela, et al.

With recent advances in computational tools and techniques, researchers are moving closer to the goal of identifying and describing complex meanings—semantic, discursive, social, and otherwise—in historical texts. This session approaches that goal from multiple angles. We discuss semantic meaning in terms of distributional semantic techniques, which connect the study of meaning in the humanities with the quantitative study of language in computational linguistics. We discuss discursive meaning via topic modelling techniques, and also explore the theoretical space between distributional semantics and topic modelling. Finally, we discuss social and historical meanings by looking at possibilities for analysing extra-linguistic contexts alongside linguistic data, within carefully annotated, structured data sets.

 

If that’s whet your appetite, you will find full abstracts for each paper–and for every paper in the Congress–on the main DHC site.

Last registration date is 7 September.