How Big Data Helps Reveal Ghostwriters And Bust Plagiarists
Can we determine whether a certain writer actually penned a certain work? Using technological analysis, the answer is a reliable 'yes.'
ST. GALLEN — At the beginning of this year, the Swiss universities in St. Gallen and Bern denounced the student practice of using ghostwriters to pass off work as their own.
Though universities are not yet using sophisticated technological tools to analyze student papers, the issue raises a number of questions in a host of applications. How can we identify the author of a letter, an anonymous e-mail or a contested will? Are there ways to bust plagiarists? Can we determine whether the text was written by a woman or a man? Can we detect the presence of a sexual predator in a chat? In tackling such issues, computer algorithms can provide answers whose reliability varies from 70% to 95%, depending on the type of problem and its context. Some examples:
Gary, aka Emile Ajar
In literature, authors sometimes write novels under assumed names. Romain Gary, for example, wrote under the pseudonym Emile Ajar in the 1970s. Technological tools can help to highlight similarities between two novels written under two separate names, and indicate whether a single author may have written both texts. The collaboration between writers raises the question of which parts of a work are written by whom, like in the case of the play The Two Noble Kinsmen (a collaboration between W. Shakespeare and J. Fletcher) or Psyche(P. Corneille and Molière).
Sometimes this gives rise to heated discussions. For example, several well-known plays are attributed to Molière, but stylistic studies emphasize their disturbing proximity to the writings of Pierre Corneille. As for Psyche, one can detect a collaboration between two writers, or support that these pieces are written by Pierre Corneille.
In the biblical texts of the 14 epistles of St. Paul, seven are unanimously recognized as the work of St. Paul himself, four by the majority of researchers, and two remain disputed. On the other hand, researchers unanimously agree that Hebrews was not written by St. Paul. Another example is the Book of Mormon, which is attributed to Joseph Smith but remains contested.
In politics, the use of ghostwriters raises no ethical problem, and the practice is nothing new. For example, George Washington rarely wrote his speeches, often leaving the editorial work to Alexander Hamilton or James Madison. But because the first U.S. president delivered an average of just three important speeches a year, this issue was largely insignificant. Since then, politics have changed. Modern American presidents now deliver a speech a day on average.
To determine a document's real author, several computer techniques focus on language, particularly repeated words (the, that, that), pronouns (we, you, me) or auxiliary verbs (is, are). An analysis of frequent combinations of two words then confirms a quota. Other big data strategies are based on formulations or expressions typical for a given author (like Jacques Chirac's use of the French word "abracadabrantesque" or General de Gaulle"s use of "chienlit").
The average length of sentences is also telling. The distribution of names or intensity of adjectives, pronouns and verbs can also help determine the probable author of a document. For example, Bill Clinton's style is characterized by a high frequency of pronouns while President Barack Obama uses more verbs.
Of course, the use of these technical functions requires that we have the texts of all the probable authors of a document. In the case of the universities in St. Gallen and Bern, this precondition is obviously impractical and hasn't been fulfilled. By knowing texts written by an author, technology can analyze the likelihood of that person haven written another document. The conclusion can be affirmative or negative, and the success rate varies between 65% and 90%. These values are still far from those of DNA testing, but the analytical techniques become more refined each year as the use of big data increases.
Author profiles don't set out to determine the name of a writer, but instead to identify some of that writer's characteristics. For example, can we determine whether a text was written by a man or a woman, or the approximate age of the writer? Are there stylistic characteristics of each sex? The answer is yes. Women tend to use pronouns more frequently (I, we, you), names related to social relationships (sister, friend) and express more feelings (joy, anxiety.)
The typically masculine style is characterized by a higher frequency of determinants (the, the, of), nouns (table, computer) or the use of numbers. In the blogosphere, men are distinguished by themes related to employment, sports or technology, while women tend to tackle topics of family, friends and food through a more emotional language. Young people between ages 14 and 18 are more likely to use abbreviations ("lol"). They also tend to write shorter sentences and more frequently repeat words. In contrast, older people use longer sentences and have a richer vocabulary.