Revelations

Here are the results of the first study of plagiarism in Wikipedia that has ever been undertaken. I started with a list of 16,750 Wikipedia articles. They came from a partial list of Wikipedia biographies of persons born before 1890. There was no reason for this, other than the fact that the list was available and the size was manageable.

The next task was to download the XML version of each Wikipedia article and try to extract between one and five clean sentences. They have to be clean (i.e., "de-wikified") or there is no hope of using them in a search engine. That's because the entire sentence is searched inside of quotation marks. Each search looks like this:

-wikipedia -wiki "this is a complete sentence from an article"

When the sentence is inside of quotation marks, any little variation inside the sentence can mean the difference between a hit or no-hit. The cleaner the sentence, the better. The reason for the exclusion terms in front is that I don't want mirrors of Wikipedia. I'm looking for plagiarism that goes the other direction — in other words, for articles that existed prior to the Wikipedia article.

My code for finding clean sentences tried to extract up to five sentences from various places in each article. When I had a choice, I went for long sentences, for sentences with the longest average word length, and for sentences with the most proper names (most upper-case characters).

It turned out that my code threw out 28 percent of my files because not even one sentence could be extracted. Of the remaining 12,095 files, the average number of sentences per file was 2.38. Some had five, but many more had just one. Then I ran these through Scroogle, which required 29,000 searches, one for each sentence. After that, I threw out more files because Google found no hits. Now I had 5,867 Wikipedia articles with some data from Google: the sentences that hit and up to 10 URLs where it hit.

There was still too much noise in the data. For every site that takes content from Wikipedia and properly gives credit, there are several that take content and don't give credit. These sites are AdSense carriers, art galleries, travel agencies offering tour packages, sites selling classical music CDs, and everything else you can imagine. All the URLs were then sorted by domain, and each domain checked once manually. This took several hours a day for two weeks, and produced a list of 965 "scraper" domains. They were put in a file that I call "rogues."

I back-purged my data from Google and deleted any files that only had data from rouges. That got rid of 71 percent of the noise, and now I had just 1,682 files. I looked at each of these Wikipedia files side by side with the suspicious URL, and discovered that there was still a massive amount of noise. After an additional two weeks of manual purging, I ended up with 150 files. These were then looked at once again, and the highlighting inserted manually. Another eight were rejected, leaving the 142 examples of plagiarism below.

The bottom line is that I tried to investigate plagiarism using a sample of about one percent of Wikipedia's 1.46 million English-language articles. I found plagiarism in about one percent of those I examined.

This one percent figure is conservative. For one thing, the nature of my original sample meant that several public-domain encyclopedias covered many of the biographies. For example, the 1911 Britannica is in the public domain (over 12,000 Wikipedia articles incorporate text from this edition of Britannica). An old Catholic encyclopedia, a Jewish encyclopedia, and an Australian encyclopedia are also in the public domain, and some government sites have public-domain history sections. I ended up excluding many articles as soon as I saw an attribution on the Wikipedia article, indicating that portions were copied from a public domain source.

Another reason my one percent figure is conservative is that my average of 2.38 sentences per article undoubtedly missed a lot of plagiarized content. If the entire Wikipedia article was plagiarized, I should have caught it. But frequently a couple of paragraphs only are plagiarized, and my sentences could have been from non-plagiarized portions of the Wikipedia article. Finally, I assumed that the original content was still online, and that Google indexed it, and that Google's algorithm performed well enough to produce it.

Because of these limitations, I believe that the actual plagiarism rate on Wikipedia is at least two percent, instead of the one percent I was able to find.

Adam Asnyk	Adam Lindsay Gordon	Adam Sedgwick	Alain LeRoy Locke
Albert Bassermann	Albert Fonó	Albert Thomas	Alois Jirásek
Alonzo M. Clark	Anna Magdalena Bach	Anne Bradstreet	Arthur Foote
Arthur Otway	August Zaleski	Bat Masterson	Belle Moskowitz
Benjamin Waterhouse	Bernhard Karlgren	Bruce Rogers	Channing H. Cox
Charles Camsell	Charles Fabry	Charles Klein	Charles Manly
Charles Robinson	Charles W. Leng	Charles Wheatstone	Charles Young
Clement Studebaker	Clifford Whittingham Beers	Cuno Amiet	Édouard Vuillard
Elizabeth O'Neill Verner	Elwood Haynes	Ernest Everett Just	Ernest Malinowski
Ernst Mach	F. Matthias Alexander	Félix Vallotton	Francis Cunningham
Frank E. Lucas	Gabriel Grovlez	George Arliss	George Augustus Selwyn
George Formby (Senior)	Georges Bizet	Gopabandhu Das	Guillermo Valencia
Gurdon Saltonstall Hubbard	Hans Christian Ørsted	Hans Rottenhammer	Harold Brighouse
Harry Beaumont	Henri Pitot	Henry Bell	Henry H. Blood
Henry Kimball Hadley	Henry Nehrling	Henry Williams (missionary)	Henryk Arctowski
Henryk Sienkiewicz	Herbert Stothart	Hisashige Tanaka	Horace Pippin
Hulbert Footner	James Clark Ross	James E. West (Scouting)	James Ensor
James Francis Smith	James H. Price	James MacLaine	James Reddy Clendon
James Tait	Jean-Pierre Duport	Johann Zoffany	John Dubois
John Glover (general)	John Herbert Claiborne	John Macoun	John Mercer (scientist)
John Reresby	John Rutledge	John Skene	John Struthers (biologist)
John Tebbutt	Johnny Burke	Jones Very	Josef Thorak
Joseph Pitton de Tournefort	Józef Wybicki	Julia Tutwiler	Julio C. Tello
Julius L. Meier	Jupiter Hammon	Landon Carter	Leonard Woolley
Lewis Sperry Chafer	Lewis Tappan	Lorenzino de' Medici	Ludwig Binswanger
Martin Grove Brumbaugh	Mary Church Terrell	Matteo Ricci	Max Pechstein
Melvin Jones	Mercy Otis Warren	Meredith Miles Marmaduke	Michael Thonet
Milton Bradley	Murray Seasongood	Nels H. Smith	Nicolaas Witsen
Octaviano Ambrosio Larrazolo	Olive Higgins Prouty	Olive Schreiner	Oswald Veblen
Pete Hill	Peter Lely	Philip William Otterbein	Pierre de Fermat
Pieter Post	Radomir Putnik	Rebecca Nurse	Richard Willstätter
Sol White	States Rights Gist	Thomas Ashby	Thomas Coke (Methodist)
Thomas Gisborne	Thomas Willis	Vilhelm Hammershøi	Virginia Dare
Vittorio Emanuele Orlando	Wilhelm Nusselt	William Coddington	William D. Upshaw
William E. Johnson	William M. Meredith	William Pepperell Montague	William Watts Folwell
Zechariah Chafee	Zénobe Gramme

If Wikipedia's editors can plagiarize from others, does this mean that reporters can plagiarize from Wikipedia? Not at all. Don't even think about it. When Wikipedians catch a reporter using material from Wikipedia without attribution, they become indignant and complain to his editor.

Implications of this study
The position of Wikimedia Foundation is that it is a service provider, and any issues raised by plagiarism, insofar as they are also copyright violations, are covered under the Digital Millennium Copyright Act (DMCA). Jimmy Wales is the registered agent for the Foundation under the DMCA. Basically, this position means that once the Foundation is made aware of a specific problem, a take-down of the violating material absolves the Foundation of liability.

However, it is not clear that Wikimedia Foundation qualifies as a service provider. It owns and operates Wikipedia's servers and its employees have ultimate control over Wikipedia software. A hierarchy of power to moderate content is controlled by this software. A Wikipedian can be an arbitrator, a steward, a bureaucrat, an administrator, a user, or an "anon" (the last is a user who edits without a username, and only the IP address is shown). There are over one thousand administrators, who have the power to ban or block users and anons, and delete content, or protect and unprotect articles.

This formal structure of power is designed to enforce the many policies, formal and informal, about what constitutes appropriate content on Wikipedia. I believe that this structure in itself means that the Foundation is already much closer to a "publisher" than a "service provider." In the future this will become even more evident. Wikipedia is available for downloading as a complete set of articles, and already it is appearing pre-installed on third-party hardware, from flash memory sticks to servers. There is an agreement to pre-install it on those $120 laptops destined for Africa. One hears talk about a possible print version of Wikipedia.

The U.S. Copyright Office is bound by a definition of "service provider" because of the language in the DMCA:

The fact that the Office has accepted a designation of an agent and has included it in the Office's directory of agents should not be construed as a judgment by the Office that the designation is sufficient or error-free.

Definition: For purposes of section 512(c), a "service provider" is defined as a provider of online services or network access, or the operator of facilities therefor, including an entity offering the transmission, routing, or providing of connections for digital online communications, between or among points specified by a user, of material of the user's choosing, without modification to the content of the material as sent or received. [emphasis added]

The moderation that occurs on Wikipedia apparently disqualifies the Foundation as a "service provider," based on that last phrase. If this is true, then what are the implications for Wikipedia? If it is a publisher rather than a service provider, what does this mean?

Primarily it means that the Foundation requires more due diligence to avoid copyright violations. Administrators already make efforts to patrol copyright violations on images posted by users to illustrate articles, but no meaningful efforts have ever been made to detect plagiarism. If the Foundation is not protected by the special status of "service provider" as defined by the DMCA, then a lack of prior due diligence will increase the Foundation's liability for copyright violations. ( A similar situation, discussed elsewhere, exists with respect to defamation of character and invasion of privacy issues, due to Section 230 of the Communications Decency Act. )

I believe that the Foundation should launch a project to scan for plagiarism on all 1.46 million articles. This is a major task. Requesting special but temporary automated-access arrangements with Google and/or Yahoo is the easy part. The hard part is separating the signal from the noise. Articles that have been in Wikipedia more than a few months have been scraped widely, and attempting to determine who plagiarized whom is not something a program can do. But a thousand administrators and tens of thousands of eager users can do it. As just one person, I did one percent of Wikipedia in six weeks, which shows that it can be done and that the results are worthwhile. Now it's up to Wikimedia Foundation to finish the job.

Jimmy Wales comments on plagiarism
(about a user who plagiarized a few sentences from two movie descriptions)
We need to deal with such activities with absolute harshness, no mercy, because this kind of plagiarism is 100% at odds with all of our core principles. All admins are invited to block any and all similar users on sight. Be bold. If someone takes you to ArbCom over it, have no fear. We must not tolerate plagiarism in the least.

— Jimbo Wales 04:28, 28 December 2005

There is no need nor intention to be vindictive, but at the same time, we can not tolerate plagiarism. Let me say quite firmly that for me, the legal issues are important, but far far far more important are the moral issues. We want to be able, all of us, to point at Wikipedia and say: we made it ourselves, fair and square.

— Jimbo Wales 15:54, 28 December 2005

Source: Wikipedia-watch.org
The image “http://www.wikipedia-watch.org/gifs/staffbl5.gif” cannot be displayed, because it contains errors.