A Response to Egan et al

Given below is my response to Santiago Segarra, Mark Eisen, Gabriel Egan & Alejandro Ribeiro (2019): A Response to Pervez Rizvi's Critique of the Word Adjacency Method for Authorship Attribution, ANQ: A Quarterly Journal of Short Articles, Notes and Reviews, DOI: 10.1080/0895769X.2019.1590797.

For brevity, I have quoted just enough from the article for readers to find the passage I am referring to. Readers should have that article in front of them, and of course my own article, when they read what follows. I refer to the authors as Egan et al, since Egan is named as the corresponding author.

Egan et al begin with a misrepresentation: "Rizvi's first objection is that the Word Adjacency Network method attends only to function words...". That was part of my brief description of their method, with a tongue-in-cheek example, as I said, to show what that means in practice. Bizarrely, my example has been hailed by them as supportive of their approach. They write: "Fletcher and Massinger, when reproducing Shakespeare's style, kept his exact same use of function words, thereby implicitly accepting the fact that part of Shakespeare's signature use of language is his choice and sequencing of function words." There are two points to note here. First, my example was from two plays whose authorship we think we know. That is why we can say that one passage is a parody of the other. If the authorship of one of the plays were unknown, Egan et al's method would wrongly treat this example as evidence of common authorship, making it odd, to put it mildly, for them to claim that my example actually supports them. Secondly, it is of course absurd for Egan et al to purport to look into Fletcher and Massinger's minds and claim that they were "implicitly accepting the fact that part of Shakespeare's signature use of language is his choice and sequencing of function words."

Egan et al go on to give an example from hurricane prediction in Texas to make the point that a model does not need to capture every variable for it to be of use, perhaps very accurate use. That is correct, and I did not suggest otherwise. However, there are two reasons why this explanation is too complacent here. First, for them to appeal to the argument that, notwithstanding the exclusion of non-function words from their method, it works well and is therefore valid, is to sacrifice the theoretical basis for the method. The safest methods are the ones where the results from experiments are explained by the theory. A scientist would not be satisfied merely with showing that a formula he has invented can make accurate predictions: he would seek to explain why the formula makes sense in theory. The more often you explain away some theoretical problem by saying "the method works despite this problem", the more vulnerable is your method to the possibility that you have fitted it to a specific set of data and it might not be generally applicable. The experiments are there to confirm that your theory is correct for the data you use in the experiments; but only a sound theoretical framework can assure you that you can safely apply the method to other data.

Furthermore, in this case, Egan et al happen to be making a strong claim. They claim that the adjacencies measured from those words are enough to give them a probability distribution, which is a pre-requisite for the use of Markov chain mathematics. I accept that a probability distribution does not need to be perfect, but it needs to be defensible. If it is not then, again, you sacrifice your theoretical framework. I showed, with my 'devoid of' example that there is at least room for doubt that the function word adjacencies do indeed provide a probability distribution. I picked the simplest possible example, because it was easy for me to find. There are any number of more complex possible scenarios, wherein a combination of non-function words might exercise a demonstrable influence on the function words in proximity to them. At the very least this puts the onus on Egan et al to explain why they really do have a probability distribution when, as I showed, the formulae they are using would make absolutely any set of numbers look like a probability distribution. They make no attempt at all to argue for the correctness of their probability distribution; they simply assert it. Again, this recourse leads them to sacrifice a little more of the theoretical basis, leaving them to rely solely on their test results.

Egan et al go on to address my objection that they excluded arguably the most important evidence of all, because their formulae fail when confronted with zeroes. Their first attempt to answer this is simply to invent something I did not write ("Rizvi suggests that it is obviously the case that the latter transition...") and purport to challenge it, ignoring my actual criticism. Next, they offer an excuse: "It might well be that if we were to observe a further 1,000 words..." They have tried to distract attention from their failure to take important evidence into account, by speculating about what might have happened if the evidence had been different. They then attempt to turn the failure of their formulae into a virtue: "Our method encodes this preference by discarding the transitions that appear fewer than k times. For the published method, we selected k = 1...". This is a post-hoc rationalisation. As anyone can see from their published articles, no such selection was ever mentioned, nor would it have excused the failure of their formulae even if it had been mentioned. They presumably hope that this hopeless combination of excuses will pull the wool over people's eyes and make their failure seem like a carefully worked out strategy. Their next sleight of hand is this: "We looked for all such pairings and recorded their strengths for all positive transitions, whereas Rizvi confines himself only to those for which the value in one canon, the k parameter, was equal to zero." I criticised them for excluding the evidence, but they try to make it seem as if I had asked them to consider just that evidence on its own.

As I wrote above, when you lose the theoretical basis of your method by making irrational exclusions forced on you by your inept formulae -- which is the case here -- you have no choice but to stake all your credibility on your test results. Egan et al finally admit this, when they write: "No matter how it does what it does, an authorship attribution method deserves scholarly attention if it can be objectively shown to be a good predictor of who wrote what ..." They say this as a prelude to answering my point about their claims of high success rates, going up into the 90+ percent. There is a long section, beginning "Here Rizvi has simply misunderstood ..." in which they invent a criticism I did not make and spend a lot of words purporting to answer it. My criticism was very simple. They had used 94 plays to derive the set of function words they considered optimal. Those 94 plays were therefore ineligible to take part in the validation of the method. It is an elementary principle that the data you use to train your method cannot then be used again to validate it. Egan et al affect not to have picked up this point and therefore they do not address it.

Finally, Egan et al offer a pageful of smoke-and-mirrors to answer my point that they manipulated their results to make them seem more impressive than they were. There was one simple and effective way -- indeed, the only way -- for them to answer me. That was to disclose their actual results, which they did not do in their earlier articles. They have chosen once again not to disclose those results. Readers will draw their own conclusions.

Supplementary Criticisms


My article called 'Authorship Attribution for Early Modern Plays using Function Word Adjacency Networks: A Critical View', published in American Notes and Queries in 2018, sets out my main arguments against the function words adjacency networks method. By that method, the New Oxford Shakespeare editors had divided up the Henry VI trilogy between Shakespeare and Marlowe (Segarra et al 2016). I argued in my article that the method has been badly defined, and its claimed success is illusory and not to be relied on.

The mathematical definition of the method has been given by its inventors in several articles in the last few years. I cited the latest definition, in Eisen et al 2018. However, to make my article suitable for non-specialist readers, I included in it only my non-technical objections. I omitted the more mathematical parts of my arguments against the method. These are given below, under separate headings.

References below to formulae are to the ones given in Eisen et al 2018 and I have assumed that the reader has read section 2 of that article, which is where the formulae are given.

The method cannot reliably distinguish between cases where pairs of function words appear close together but infrequently, or far apart but frequently

The method quite correctly aims to give less weight to function words that are far apart than those that are close together. However, its formula 1 is not good enough to distinguish between texts where, on the one hand, function words occur just a few times but close together and, on the other hand, texts where they occur often but further apart. The formula is liable to give the same answer, or very similar answers, for both texts, making it impossible for the subsequent steps in the method to distinguish between them.

This defect was apparently not understood when the method was invented, since it is not mentioned in the first published definition of it (Segarra et al 2013). It was understood later, for Seggara et al 2015 contains the following admission: "Notice that [formula 1] combines into one similarity number the frequency of co-appearance of two words and the distance between these two words in each appearance, making both effects indistinguishable" (Segarra et al 2015, 5466; my emphasis).

It is important that readers, especially scholars who may be thinking of using the method in their own research, be alerted to this defect and, therefore, it was wrong of Eisen et al 2018 to omit the admission that Segarra et al 2015 had made, especially as the 2015 article is found in a journal that is unlikely to be read by humanities scholars.

The method takes no account of the relative sizes of the texts it compares, allowing a short text to carry the same weight as a long one

In analytical work that seeks to compare data sets of differing lengths, we need to ensure that we compare like with like. For example, suppose a certain collocation of words occurs the same number of times in each of two texts. If one text is ten times as long as the other, then our intuition tells us that it is misleading to say that both texts use the collocation equally. Conversely, suppose that the text which is ten times as long also uses the collocation ten times as often as the other text. In this case, although the collocation occurs much more often in one text than in the other, we may fairly say that both texts use it equally. If we do not have samples of approximately equal size, we might compensate for the disparity in size by dividing the number of occurrences of the collocation by the number of words in the corresponding text, in effect turning the raw numbers into proportions or percentages, so it becomes possible to compare them fairly.

What this method does instead is to divide each adjacency distance from a word by the sum of all the distances from that word. It calls this normalization and it is defined by its formula 3. Put like that, it sounds reasonable, but some further thought reveals the problem with it. For example, suppose one text gives us the distances {2, 3} because it is a very short text. The method divides each number by the sum, in this case 5, to obtain {0.4, 0.6}. If another, much larger, text gives us the distances {200, 300} then the method divides each number by 500 and again obtains {0.4, 0.6}. As soon as the normalization is performed, all knowledge about the sizes of the texts being tested is obliterated, and it can therefore play no part in subsequent steps. A method like this, that simply disregards the sizes of the texts it is attributing, is liable to go wrong by putting large amounts of evidence on a parity with small amounts.

The method cannot satisfactorily handle texts from which some function words or pairs of words are absent, choosing unwisely to either ignore the absence or treat the words as if they were present, in both cases distorting the results

My ANQ article already explains that, in its entropy calculation, the method disregards pairs of function words that are present in some texts but absent from others. I showed that, in the case of Shakespeare and Marlowe, this unwise decision leads to about 28% of evidence -- ex hypothesi the most important evidence -- being excluded. What I explain below is how, in its normalization procedure, the method also does the opposite, i.e. it deems the presence of pairs of words that are in fact absent.

The normalization formula (formula 3) breaks down if some function word happens not to have any function words (among the subset of words being searched for) following it within the ten-word windows that the method searches in. That's because the denominator is then zero, and it is impossible to divide by zero. To get around this problem, the method deems that in such a case that function word is followed by every other function word in equal proportion! (Eisen et al 2018, 502). The inventors make no attempt to explain why this is a reasonable thing to do, when we know even from casual observation that words do not follow other words in equal proportion in any text.

I also want to draw attention to the fact that the inventors of this method were originally candid in admitting to the fault that led them to exclude all the evidence that would cause their entropy calculation, given in formula 7, to fail because of an attempted division by zero. Segarra et al 2015 told us correctly: "This is undesirable because the often [sic] appearance of this transition in the text network P1 is a strong indication that this text was not written by the author whose profile network is P2" (Segarra et al 2015, 5467). That was a technical way of saying what is intuitively obvious, that if one function word is often followed by another in one text but never in the other, the explanation might be that they were written by different authors. Regrettably, Eisen et al 2018 omits this correct explanation and substitutes an incorrect one in its place, by saying that the purpose of the rule that excludes the evidence is to avoid "potential biasing for smaller profiles" (Eisen et al 2018, 503). This is nonsense, because the exclusion of the evidence takes place even when the texts being tested are large. The exclusion is triggered not by the text being too small, but by its not containing function words that the other text does contain. As I have shown with the examples of Shakespeare and Marlowe, even large canons have some function words that never follow some other function words. The point about excluding this evidence in order to avoid a bias because of small samples is thus seen to be false. The original, correct, explanation in Segarra et al 2015 should have been disclosed to readers of Eisen et al 2018.

The method asserts what it needs to prove, that its adjacency networks are Markov chains

In order to calculate entropy values, which is what the method needs in order to make authorship attributions, the word adjacency network it has constructed must be a Markov chain: if it is not, then the entropy formula cannot be used, and the method fails before it gets to the authorship attribution stage. It was therefore essential for the inventors to prove that their networks are in fact Markov chains. They make no attempt to do this. Instead, they simply say that their networks "can be interpreted" as Markov chains (Eisen et al 2018, 502) and carry on from there. The reader should look up a definition of Markov chains, for example on Wikipedia, to understand just how surreal is the idea that a play can be treated as a Markov chain.

To demonstrate that the networks are in fact Markov chains, the inventors had to show that their normalization calculations had given them a probability distribution, since that is a condition for a Markov chain. But they simply assert it, instead of proving it, and I showed in my ANQ article, with examples like devoid of, how far from obvious it is that the networks are the probability distributions they need to be. What’s worse, the way that their normalization formula (formula 3) has calculated the values used in subsequent steps -- by dividing each value by the total of all of them -- guarantees that their data always looks like a probability distribution, because all the numbers are between 0 and 1 and they add up to 1. Even if they were to pick adjacency numbers out of a hat, instead of measuring them from the play texts, their formula 3 would turn them into what looks like a probability distribution, allowing them to get away with claiming that the network "can be interpreted" as a Markov chain.

The complete absence of proof, or even a plausible argument, means that it is invalid to assume that the word adjacency networks are Markov chains, and the method therefore collapses before it gets to the attribution stage.

Works Cited

Eisen, M, Ribeiro, A, Segarra, S, and Egan, G. (2018). Stylometric Analysis of Early Modern Period English Plays, Digital Scholarship in the Humanities, 500-528.

Segarra, S, Eisen, M, Egan, G, and Ribeiro, A. (2016). Attributing the Authorship of the Henry VI Plays by Word Adjacency, Shakespeare Quarterly, 67: 232-56.

Segarra, S, Eisen, M, and Ribeiro, A. (2015). Authorship Attribution Through Function Word Adjacency Networks, Institute of Electrical and Electronic Engineers (IEEE) Transactions on Signal Processing, 63.20: 5464-78.

Segarra, S, Eisen, M, and Ribeiro, A. (2013). Authorship Attribution Using Function Words Adjacency Networks [accessed 9 February 2019].