How Big Data Helps Reveal Ghostwriters And Bust Plagiarists

Can we determine whether a certain writer actually penned a certain work? Using technological analysis, the answer is a reliable 'yes.'

Jacques Savoy

ST. GALLEN â€" At the beginning of this year, the Swiss universities in St. Gallen and Bern denounced the student practice of using ghostwriters to pass off work as their own.

Though universities are not yet using sophisticated technological tools to analyze student papers, the issue raises a number of questions in a host of applications. How can we identify the author of a letter, an anonymous e-mail or a contested will? Are there ways to bust plagiarists? Can we determine whether the text was written by a woman or a man? Can we detect the presence of a sexual predator in a chat? In tackling such issues, computer algorithms can provide answers whose reliability varies from 70% to 95%, depending on the type of problem and its context. Some examples:

Gary, aka Emile Ajar

In literature, authors sometimes write novels under assumed names. Romain Gary, for example, wrote under the pseudonym Emile Ajar in the 1970s. Technological tools can help to highlight similarities between two novels written under two separate names, and indicate whether a single author may have written both texts. The collaboration between writers raises the question of which parts of a work are written by whom, like in the case of the play The Two Noble Kinsmen (a collaboration between W. Shakespeare and J. Fletcher) or Psyche (P. Corneille and Molière).

Sometimes this gives rise to heated discussions. For example, several well-known plays are attributed to Molière, but stylistic studies emphasize their disturbing proximity to the writings of Pierre Corneille. As for Psyche, one can detect a collaboration between two writers, or support that these pieces are written by Pierre Corneille.

Saint Paul

In the biblical texts of the 14 epistles of St. Paul, seven are unanimously recognized as the work of St. Paul himself, four by the majority of researchers, and two remain disputed. On the other hand, researchers unanimously agree that Hebrews was not written by St. Paul. Another example is the Book of Mormon, which is attributed to Joseph Smith but remains contested.


In politics, the use of ghostwriters raises no ethical problem, and the practice is nothing new. For example, George Washington rarely wrote his speeches, often leaving the editorial work to Alexander Hamilton or James Madison. But because the first U.S. president delivered an average of just three important speeches a year, this issue was largely insignificant. Since then, politics have changed. Modern American presidents now deliver a speech a day on average.

Analysis techniques

To determine a document's real author, several computer techniques focus on language, particularly repeated words (the, that, that), pronouns (we, you, me) or auxiliary verbs (is, are). An analysis of frequent combinations of two words then confirms a quota. Other big data strategies are based on formulations or expressions typical for a given author (like Jacques Chirac's use of the French word "abracadabrantesque" or General de Gaulle"s use of "chienlit").

The average length of sentences is also telling. The distribution of names or intensity of adjectives, pronouns and verbs can also help determine the probable author of a document. For example, Bill Clinton's style is characterized by a high frequency of pronouns while President Barack Obama uses more verbs.


Of course, the use of these technical functions requires that we have the texts of all the probable authors of a document. In the case of the universities in St. Gallen and Bern, this precondition is obviously impractical and hasn't been fulfilled. By knowing texts written by an author, technology can analyze the likelihood of that person haven written another document. The conclusion can be affirmative or negative, and the success rate varies between 65% and 90%. These values are still far from those of DNA testing, but the analytical techniques become more refined each year as the use of big data increases.

Gender differences

Author profiles don't set out to determine the name of a writer, but instead to identify some of that writer's characteristics. For example, can we determine whether a text was written by a man or a woman, or the approximate age of the writer? Are there stylistic characteristics of each sex? The answer is yes. Women tend to use pronouns more frequently (I, we, you), names related to social relationships (sister, friend) and express more feelings (joy, anxiety.)

The typically masculine style is characterized by a higher frequency of determinants (the, the, of), nouns (table, computer) or the use of numbers. In the blogosphere, men are distinguished by themes related to employment, sports or technology, while women tend to tackle topics of family, friends and food through a more emotional language. Young people between ages 14 and 18 are more likely to use abbreviations ("lol"). They also tend to write shorter sentences and more frequently repeat words. In contrast, older people use longer sentences and have a richer vocabulary.

Keep up with the world. Break out of the bubble.
Sign up to our expressly international daily newsletter!

In Argentina, A Visit To World's Highest Solar Energy Park

With loans and solar panels from China, the massive solar park has been opened a year and is already powering the surrounding areas. Now the Chinese supplier is pushing for an expansion.

960,000 solar panels have been installed at the Cauchari park

Silvia Naishtat

CAUCHARI — Driving across the border with Chile into the northwest Argentine department of Susques, you may spot what looks like a black mass in the distance. Arriving at a 4,000-meter altitude in the municipality of Cauchari, what comes into view instead is an assembly of 960,000 solar panels. It is the world's highest photovoltaic (PV) park, which is also the second biggest solar energy facility in Latin America, after Mexico's Aguascalientes plant.

Spread over 800 hectares in an arid landscape, the Cauchari park has been operating for a year, and has so far turned sunshine into 315 megawatts of electricity, enough to power the local provincial capital of Jujuy through the national grid.

It has also generated some $50 million for the province, which Governor Gerardo Morales has allocated to building 239 schools.

Abundant sunshine, low temperatures

The physicist Martín Albornoz says Cauchari, which means "link to the sun," is exposed to the best solar radiation anywhere. The area has 260 days of sunshine, with no smog and relatively low temperatures, which helps keep the panels in optimal conditions.

Its construction began with a loan of more than $331 million from China's Eximbank, which allowed the purchase of panels made in Shanghai. They arrived in Buenos Aires in 2,500 containers and were later trucked a considerable distance to the site in Cauchari . This was a titanic project that required 1,200 builders and 10-ton cranes, but will save some 780,000 tons of CO2 emissions a year.

It is now run by 60 technicians. Its panels, with a 25-year guarantee, follow the sun's path and are cleaned twice a year. The plant is expected to have a service life of 40 years. Its choice of location was based on power lines traced in the 1990s to export power to Chile, now fed by the park.

Chinese engineers working in an office at the Cauchari park


Chinese want to expand

The plant belongs to the public-sector firm Jemse (Jujuy Energía y Minería), created in 2011 by the province's then governor Eduardo Fellner. Jemse's president, Felipe Albornoz, says that once Chinese credits are repaid in 20 years, Cauchari will earn the province $600 million.

The Argentine Energy ministry must now decide on the park's proposed expansion. The Chinese would pay in $200 million, which will help install 400,000 additional panels and generate enough power for the entire province of Jujuy.

The park's CEO, Guillermo Hoerth, observes that state policies are key to turning Jujuy into a green province. "We must change the production model. The world is rapidly cutting fossil fuel emissions. This is a great opportunity," Hoerth says.

The province's energy chief, Mario Pizarro, says in turn that Susques and three other provincial districts are already self-sufficient with clean energy, and three other districts would soon follow.

Keep up with the world. Break out of the bubble.
Sign up to our expressly international daily newsletter!