About the Concept Modelling Demonstrator

Concept models, or quads, are sets of four lemmas that occur together in spans of text up to 100 words long. Quads can be searched and ranked, and examples of each quad can be read in their original context.

On this site you will find concept models built from 1,000 of the most frequently occurring nouns, adjectives, and verbs in EEBO-TCP (the first word in the search interface). The second, third, and fourth words of each quad are restricted to nouns, adjectives, and verbs that occur at least 5,000 times in EEBO-TCP (but systematically excluding some high frequency words).* The second, third, and fourth words of each quad occur within 50 words (tokens) to the left or right of the first word. We exclude quads that do not pass a Pearson’s Chi-Square test threshold of 2.706 (p<0.05).

Each quad can be displayed in 24 different ordered combinations. The second, third, and fourth words of each quad are by default presented here in alphabetical order, though you can input search terms in any order.

You can search quads by selecting a first word from 1,000 of the most frequently occurring words in EEBO-TCP, and inputting up to three additional words. You can limit your search by frequency band or MI (Mutual Information) band. Frequency band 5 contains the most frequent quads in EEBO-TCP; band 1 contains the least frequent quads. MI band 5 contains the quads with highest strength of co-occurrence; MI band 1 contains the quads with lowest strength of co-occurrence. You may also conveniently select to ‘show only prominent concepts’, which will return only quads in frequency and MI bands 3 to 5.**

You can view and rank search results according to frequency or MI.

When viewing search results, you may click ‘View Documents’ to view all documents containing the selected quad. Documents containing denser examples of each quad are listed first, i.e. where each word in the quad may occur more than once in close proximity.

When viewing documents, you may click ‘View Details’ to view the text containing the quad, and to study it in its context.

Citation

To cite linguistic concept modelling:

Mehl, Seth. 2020. ‘Discursive quads: New kinds of lexical co-occurrence data with linguistic concept modelling’. Pre-print version here.

Further Reading

Mehl, Seth. 2019. Mapping lexical co-occurrence statistics against a part of speech baseline. In Hannah Parviainen, Mark Kaunisto and Paiva Pahta (eds), Corpus Approaches into World Englishes and Language Contrasts. Helsinki: eVarieng. http://www.helsinki.fi/varieng/series/volumes/20/mehl/.

Fitzmaurice, Susan, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, Fraser Dallachy. 2017. Linguistic DNA: Investigating conceptual change in Early Modern English discourse. Studia Neophilologica 89, 21-38. https://www.tandfonline.com/doi/full/10.1080/00393274.2017.1333891.

Fitzmaurice, Susan, Justyna A. Robinson, Marc Alexander, Iona C. Hine, Seth Mehl, Fraser Dallachy. 2017. Reading into the past: Materials and methods in historical semantic research. In Tanja Saily, Minna Nevala, Arja Nurmi & Anita Auer (eds), Exploring future paths for historical sociolinguistics: Methods, materials, theory. Amsterdam: Benjamins. 53-82.

Data Downloads

The Concept Modelling Demonstrator was developed by the Linguistic DNA project and The Digital Humanities Institute at the University of Sheffield. The original project developed billiions of rows of data, which can be downloaded from here.


* Our analysis excludes the following auxiliary verbs and other very high frequency lemmas: be_v, have_v, do_v, will_v, shall_v, may_v, must_v, ought_v, can_v, say_v, make_v, take_v, come_v, give_v, god_n, man_n, thing_n, christ_n.

** MI score varies according to the ordered combination of each quad. We present the quads in alphabetical order, and present the MI score that is the highest of any combination of the four lemmas.