Collocations and N-grams
My aim is to produce and make freely available complete lists of collocation and n-gram matches between a large body of early modern plays, by Shakespeare and his contemporaries. Now read on.....
Download from my Microsoft OneDrive the lists of Collocations and N-grams (CAN).
N-gram search results are complete. For every play you can download the complete set of n-gram matches with every other play.
Collocation search results are a work in progress. I aim to add the results at the rate of several plays per week.
Please read the User Guide I have provided.
Please be aware that at the start of October 2017 the results files use about 12.5GB of disk space when unzipped, and that size is growing rapidly. You might prefer to browse first and only download results for the plays you are interested in.
I have copyright on the files you download but you may use them freely for non-commercial purposes. Your use of them is subject only to the Creative Commons Attribution Noncommercial license.
Until the 1990s, the discovery of n-grams and, more generally, collocations shared by plays in the early modern era was limited by researchers’ memories of what they had read, supplemented by the few concordances which were available. The availability of transcriptions of the texts on the worldwide web has largely, though not wholly, removed that limitation.
The LION and EEBO-TCP websites now provide complete coverage of the texts of surviving early modern plays, allowing us to search for matching words and phrases, including words in proximity to each other, taking variant spellings into account.
In the field of authorship attribution research, MacDonald P. Jackson has pioneered the systematic use of these resources, to support or rebut attributions by finding matching n-grams and collocations.
The method used by Jackson is largely manual, although it has been partly automated by software written by Douglas Douhame. That software automates searches in LION but it still requires the researcher to provide a file listing all the words and phrases she wishes to search for. A different step towards automation was taken by Sir Brian Vickers, who has used plagiarism-detection software to find matches between plays.
To a greater or lesser extent, all these approaches require a substantial investment of time by a researcher in performing the searches. What is needed is a more industrial approach, one that reduces the time spent on searching, freeing up that time to spend on the more productive task of analysing results. Instead of repeatedly asking the computer to find given sets of words, we need to tell it once how to recognize matching n-grams and collocations, and then tell it to find them all and list them.
This website presents the results of fully automated searches for matching n-grams and collocations among early modern plays, using programs written by me. Collocation search results of course include those for n-grams, since an n-gram is a special type of collocation. Nevertheless, for convenience, results for n-grams are provided separately from the more general collocation search results.
My searches were done using modern-spelling texts, eliminating the problem of matches being missed because of spelling differences, and using the lemmatized forms of words. This allows for the widest possible discovery of matches; for example, kind hearts is matched with kind-hearted.
For each play, search results for both n-grams and collocations are provided in three formats. (i) For casual browsing and qualitative analysis, an HTML page is provided, giving what my search program considers to be the best few thousand matches for that play. (ii) A CSV file is provided containing the full set of matches. This may be opened in Excel, or other tools, and be used for quantitative analysis. (iii) A summary is provided, also as a CSV file, giving the number of matches with each play.
The searches cover a substantial but not complete sample of early modern plays, as not all of them were freely available to me in modernized and lemmatized texts. The sample of plays I have used consists of 527 plays from the years 1552 to 1657 (the Additions to The Spanish Tragedy being counted as a separate little play).
To prevent the most significant results from being completely swamped by the thousands of insignificant ones, some results have been excluded. Bigrams and trigrams are reported only if they contain at least two words which are not among the 154 most common words in all the plays, to avoid thousands of matches for phrases like and the. Tetragrams and above are always reported. Collocations are reported only if they contain at least two words not on the common words list. That list contains mainly function words such as the, and, of, to, and so on.
As my results demonstrate, matching n-grams are abundant, and matching collocations far more so. Typically, each play has several tens of thousands of n-gram matches with other plays; and collocations are typically ten to twenty times as numerous as n-grams. The researcher’s challenge is no longer to find matches; rather, it is to evaluate the matches given to her on a plate and draw conclusions.
There should be many uses of these results. If you want to find the verbal links between two plays, you can now open the results file for either of them and then filter on the other. For example, you could look at the matches for Act 1 of Titus Andronicus and filter to see how many n-grams and collocations they share with the plays of George Peele. Having found them, you can judge them in context rather than in isolation, by looking at results for other authors’ plays for comparison. Counterexamples to rebut false authorship attributions should now be easier to find. The number of n-gram matches is more than 13 million, of which more than 5 million are tetragrams or better, and millions more collocation matches will be added every week, so there are opportunities for statistical analysis.
My work would not have been possible without the generosity of the Shakespeare His Contemporaries and Folger Digital Texts websites in making their modernized and lemmatized texts freely available. I would like to thank Martin Mueller for helping me to download the SHC texts. The Folger texts have been edited by Paul Werstine and Barbara A. Mowat, but the SHC texts, derived from EEBO-TCP, are unedited and, despite the great work by Professor Mueller and his team, they contain some errors of transmission, transcription, modernization and lemmatization, which are of course reflected in the search results.
My greatest thanks are due to Sir Brian Vickers, who encouraged me to undertake this work and generously offered to fund it, although fortunately I did not need to accept financial help. This work has so far taken up my spare time over the past six months and Brian and his colleague Darren Freebury-Jones have corresponded with me throughout. Sharing my early results with them and getting their feedback has been valuable. Gabriel Egan also helped me indirectly by giving me a copy of the New Oxford Shakespeare Authorship Companion earlier this year, which provided me with a useful introduction to authorship attribution research. I am responsible for all errors and will be grateful to be informed of them.