1. EEBO-TCP Trios
This data is used by the Early Modernity interface (see above). Each row of data represents a trio, and contains 44 columns. The columns of interest are likely to be: Column 1 (the trio); Column 2 (Lemma A); Column 3 (Lemma B); Column 4 (Lemma C); Column 35 (the frequency of the trio in EEBO-TCP); Column 43 (MI score of the trio, cf. Mehl 2019); Column 44 (Chi Square score for the trio). Other columns contain data related to parts of speech and their co-occurrence, for implementation of MI and chi-square with a grammatical baseline (cf. Mehl 2019).
Note: This dataset has several billion rows of data.
- You can download the data here (812 mb as a zip file; over 6 gb when uncompressed)
2. Newsbooks 1649
In this data, the first three columns are the three lemmas contained in the trio. Column 4 is the frequency of the trio in the dataset. Column 5 represents: given that lemma B (column 2) appears within 50 tokens to the left or right of lemma A (column 1), how many other nouns also occur within 50 tokens to the left or right of lemma A? And column 6 represents the Linguistic DNA project’s implementation of MI score (cf. Mehl 2019).
The dataset contains 119,000 rows of trio data and 607 POS-tagged texts.
- You can download the data here (7.5 mb as a zip file)
3. Militarisation 2.0
In this data, column 1 is the row number; column 2 is the frequency of the trio in the dataset; column 3 is lemma A; column 4 is lemma B; and column 5 is lemma C. This data is built on a curated list of lemmas related to the research themes; and evaluative adjectives.
This data is currently unavailable for download, but should be available soon.
4. Ways of Being in the Digital Age
In this data, the first three columns are the three lemmas contained in the trio. Column 4 is the frequency of the trio in the dataset. Column 5 represents: given that lemma B (column 2) appears within 50 tokens to the left or right of lemma A (column 1), how many other nouns also occur within 50 tokens to the left or right of lemma A? And column 6 represents the Linguistic DNA project’s implementation of MI score (cf. Mehl 2019). This data contains the 200 highest-ranked pair co-occurrences in the corpus, containing noun lemmas occurring at least 50 times in the whole dataset.
The dataset contains 62,000 rows of data.
- You can download the dataset here (12.1 mb as a zip file)