The Counting of N-grams
There is more than one way to count matching N-grams. On this site, I have provided counts based on two methods that I consider to be reasonable. It is essential to understand these methods, to avoid the risk of misusing the counts and obtaining invalid results.
Before I consider N-gram matches I will consider N-grams as such. Given a text and some number N, how many N-grams -- that is, how many phrases consisting of N consecutive words -- does it contain?
For short texts this is easy to work out. If my text is "the sky is blue" and I am considering 3-grams then it is obvious that there are only two 3-grams in the text: "the sky is" and "sky is blue". But how do we work out the answer in a general case? Before I give the formulae I want to provide the arguments by which you can see how they are worked out.
Suppose I have a text containing T words. I have chosen a number N which is between 1 and T. How many N-grams are in my text? Observe that when I find an N-gram, I can identify it unambiguously by just pointing to its first word. That is because if I know the first word then I can simply read the next N-1 words to get the complete N-gram. This teaches us that, for any fixed value of N, an N-gram is completely defined by its first word. From this we see at once that the number of N-grams in the text cannot exceed the number of words in the text, since each N-gram is defined by a word in the text. But we can do much better than that.
Take the simple case N = 2 (bigrams). We know that every 2-gram starts with one word in the text. Does this mean that there are T bigrams in the text? No, because the last word cannot be the start of a 2-gram. But every other word can. Therefore, the number of 2-grams in the text must be T-1. By a similar argument, we see that if N = 3 then the number of 3-grams in the text must be T-2, since every word except the last two is the start of a 3-gram. We can now see the general formula, which is easy enough to prove formally (proof not provided here):
No. of N-grams in a text of T words = T+1-N
As an example, consider the simple case N = 1. The number of 1-grams in the text is obviously T and that is indeed the answer the formula gives. At the other extreme, if we set N = T, it is obvious that there is only one T-gram in the text, which is the whole text, and the formula correctly tells us that.
We can generalize the formula further. Suppose I pick two numbers, A and B, both in the range 1 to T, with A <= B. I want to know how many A-grams to B-grams there are in the text. For example, I might pick A = 2 and B = 4, and want to know how many 2-grams to 4-grams there are in the text. Now, I could just work out the answer from the above formula for each value of N in the range A to B, and then add up the answers. That would work, but with a little mathematics (not given here), we can obtain a formula that gives us the answer in one go:
No. of A-grams to B-grams in a text of T words = (T+1)(B-A+1) + A(A-1)/2 - B(B+1)/2
Observe that setting A = B in this formula gives us the same answer as the first formula. The first formula is just a special case of the second formula. Finally, by setting A = 1 and B = T, in other words, by covering the full range of all possible N-grams in the text, we obtain the result:
No. of N-grams for all possible values of N in a text of T words = T(T+1)/2
Recall from your school days the formula for the sum of the natural numbers from 1 to T:
1 + 2 + 3 + 4 + ...... + (T-1) + T = T(T+1)/2
From this we get the interesting result that the number of all possible N-grams in a text of T words is equal to the sum of the natural numbers from 1 to T. For long texts such as early modern plays, this value is approximately equal to half the square of the number of words in the text. For example, King Lear has about 25,000 words; therefore, it contains approximately 300 million N-grams. This surprisingly high number prepares us for the very high numbers of matches we find when we search for these N-grams in hundreds of other plays.
You can use the tool below to do the calculation. Just enter your chosen values for A, B and T, and press the button:
In all the above formulae the calculations are based on tokens, not types. For example, if the text is "to be or not to be" then the 2-gram "to be" occurs twice and the formulae above will count it twice. If we want to count types rather than tokens then it is obvious that the answers will be lower, but no general formula is available.
Finally, it is necessary to note that some scholarly papers give very incorrect values for the numbers of N-grams in the texts they use. Gary Taylor, "Empirical Middleton: Macbeth, Adaptation, and Microauthorship" in Shakespeare Quarterly (vol. 65, p.244, 2014) tells readers that there are 124 bigrams in a passage of 63 words, which is twice the correct number. It appears from the footnote that Taylor has tried to do a calculation and got confused between bigrams and their constituent words. Gary Taylor, "Shakespeare's Early Gothic Hamlet" in Critical Studies (vol. 31 no. 1/2, p.9, 2019) claims that there are 2,419 "word sequences of two, three or four consecutive words" in a scene of 271 words, which is about three times the correct number. Taylor claims to have put these 2,419 N-grams into an Excel spreadsheet, which is impossible, so the nature of his error is unclear.
Consider two texts that share the phrase ABCD, where A, B, C and D are four words. We ask: how many 1-grams, 2-grams, 3-grams and 4-grams do these texts have in common? The answers are easy to work out:
We thus see that there are ten N-gram matches in total. I shall call these formal N-gram matches.
It would be absurd for a researcher to list all ten N-gram matches here, because they are nested within each other. The custom is to list only the longest match, which is the 4-gram ABCD. I shall call this a maximal N-gram match. A maximal match is also a formal match, but the reverse is not true.
I have provided lists of N-gram matches for every play. These are of course maximal matches, as is customary. The lists are accompanied by counts, which are therefore counts of maximal matches.
Independently of that, I have provided counts of formal N-gram matches, for use in computational stylistic tests. As I explain below, it makes more sense to count formal matches for this purpose.
In some cases, the decision is dictated by the nature of the work. If our aim is to list N-gram matches, perhaps for qualitative analysis, then maximal matches, and therefore counts of maximal matches, are appropriate. On the other hand, if we are looking at, say, 4-grams, in isolation to other N-grams, then of course we must use formal matches. In some other cases, there may be a genuine choice whether to count formal or maximal matches, as the following examples illustrate.
Suppose we are using maximal matches. We are comparing many plays, perhaps to perform authorship attribution. Suppose we compare the plays by asking how many 3-grams they have in common with each other. We would have to report that the two texts in the ABCD example above do not have any 3-grams in common (because the 3-gram they do share is nested inside a 4-gram and is therefore not maximal). This is clearly unsatisfactory, because it causes a match between the two texts to be missed.
A more satisfactory alternative is to count maximal 3-grams-and-above; here, that would allow us to report one match between the two texts, because the maximal 3-grams-and-above count includes the maximal 4-gram. From this argument we see that it can seldom or never make sense to use maximal N-gram match counts singly. We must use counts of maximal 1-grams-and-above, maximal 2-grams-and-above, maximal 3-grams-and-above, and so on. That is what the Summary spreadsheets that go with my lists of maximal N-gram matches contain (except for 1-grams, for which I did not list maximal matches at all).
That is not the end of the matter. Suppose we had another pair of texts, containing the phrases ABCD and ABCE respectively. For these, the formal N-gram matches are as below:
There is just one maximal 3-gram here, and no 4-grams. If we were using the maximal 3-grams-and-above test, we would have to report one match between these texts. Thus, we see that we report one maximal 3-grams-and-above match for the earlier texts, with ABCD and ABCD, but also one match for the texts with ABCD and ABCE. That is not satisfactory either: the texts are different, and a good test should be able to tell them apart.
Counts of formal N-gram matches do not suffer from the problem illustrated above. For the pair of texts both containing ABCD, there are two formal 3-gram matches, while for the pair of texts containing ABCD and ABCE, we have just one formal 3-gram match. The pairs of texts look different, and our formal 3-gram test duly reports them as different.
Formal N-gram counts do better in the above example because, unlike maximal N-gram counts, they have a built-in cascading effect. If, for example, we are counting maximal 3-grams-and-above, then a maximal 3-gram counts as one, as does a maximal 4-gram, as does a maximal 5-gram, and so on. By contrast, a formal 4-gram contributes two to the formal 3-gram count, a formal 5-gram contributes three to the formal 3-gram count, and so on. This means that a formal N-gram contains information not just about N-grams but also information about (N+1)-grams, (N+2)-grams, and so on. Moreover, it gives a higher weighting to higher order N-grams, rather than treating them all equally, as maximal N-gram counts do.
A further argument in favour of formal N-gram matches is that they are easier to find than maximal matches. If two texts have N successive words in common then that is a formal N-gram match, without further ado. But is it also a maximal N-gram match? We cannot decide without looking at (N+1)-gram matches: if there is an (N+1)-gram match containing the N successive words, then we do not have a maximal N-gram match; otherwise, we do.
My conclusion from the above discussion is that formal N-gram counts should (continue to) be used for computational stylistics.
The Summary spreadsheets that accompany my lists of N-gram matches contain counts of the matches listed. Since the matches listed are maximal, it means that those Summaries contain counts of maximal N-gram matches. Listing matches means listing the tokens, so these counts are counts of tokens, not types.
With formal N-grams, I have provided separate sets of counts, one set produced by counting tokens, the other produced by counting types.
If types make better authorial markers than tokens, then of course we need to use them. But unless that is the case, it makes more sense to use token-based counts, for one very important reason. Token-based counts are scalable. What does that mean?
Consider an example. There are 17,778 tokens in 1 Tamburlaine and 18,158 tokens in 2 Tamburlaine. How many tokens are there in the two plays together? The answer is easy to work out by addition: it is 35,936. Now, there are 2,422 types in 1 Tamburlaine and 2,474 types in 2 Tamburlaine. It would be wildly wrong to add these numbers together and say that there are 4,896 types in the two plays together. That is because the plays have many words in common, which are counted separately as tokens but just once as types. The actual number of types in the two plays is 3,420. It cannot be derived from any other numbers, by addition or any other method, and it must be worked out separately.
If we want to do authorship attribution using large amounts of data, perhaps testing many alternative attributions, then our methods must be scalable. That is, it must be possible for us to scale up from using a handful of numbers to using hundreds, thousands or even millions of numbers. Counts of tokens make this possible; counts of types do not. If we have counted tokens for a large set of plays, or parts of plays, then we can treat the counts as building blocks. We can assemble them in different ways, according to the authorship theories we are testing, simply by adding the right combinations of counts, without having to go back to the texts and make new counts every time. That is why, in tests involving counts, we should always use tokens, unless there is a compelling reason to use types.
My database contains 527 plays, consisting of 9,577,660 words, including stage directions but excluding speech prefixes (of which there are 347,547). A reader of the Summary spreadsheets I have provided, particularly those for formal N-gram matches, may be astonished at the very large numbers of matching N-grams. When a play contains only few tens of thousand words, how can there be tens of millions of matching N-grams for it?
The answer lies in the effects of multiplication. Suppose a phrase occurs three times in one play, and four times in another play. Each of the three occurrences of the phrase in the first play matches with each of its four occurrences in the other play, so we have twelve N-gram matches. When counting formal N-gram matches, I have not excluded any word or phrase from consideration, as I did with maximal matches. When we consider that very common phrases like and the and of the occur many times in every play, we can understand how the counts become so large. In an extreme case, if the occurs a thousand times in each of two plays, it alone contributes a million to the count of formal 1-gram matches between those plays.
The phenomenon is amplified by the fact that my N-gram searches use lemmatized forms of words, not the words themselves. This is the right thing to do, because it allows, for example, singular words to be matched with their plurals. If the searches were not lemma-based, we would miss many thousands of important N-gram matches which ought to be counted. Doing lemma-based searches allows more matches to be found; however, a side-effect is that it inflates counts of 1-gram and, to a lesser extent, 2-gram matches to levels even larger than they would be otherwise.