Plagiarism by Wikipedia editors
The next task was to download the XML version of each Wikipedia article and try to extract between one and five clean sentences. They have to be clean (i.e., "de-wikified") or there is no hope of using them in a search engine. That's because the entire sentence is searched inside of quotation marks. Each search looks like this:
-wikipedia -wiki "this is a complete sentence from an article"
When the sentence is inside of quotation marks, any little variation inside the sentence can mean the difference between a hit or no-hit. The cleaner the sentence, the better. The reason for the exclusion terms in front is that I don't want mirrors of Wikipedia. I'm looking for plagiarism that goes the other direction — in other words, for articles that existed prior to the Wikipedia article.
My code for finding clean sentences tried to extract up to five sentences from various places in each article. When I had a choice, I went for long sentences, for sentences with the longest average word length, and for sentences with the most proper names (most upper-case characters).
It turned out that my code threw out 28 percent of my files because not even one sentence could be extracted. Of the remaining 12,095 files, the average number of sentences per file was 2.38. Some had five, but many more had just one. Then I ran these through Scroogle, which required 29,000 searches, one for each sentence. After that, I threw out more files because Google found no hits. Now I had 5,867 Wikipedia articles with some data from Google: the sentences that hit and up to 10 URLs where it hit.
There was still too much noise in the data. For every site that takes content from Wikipedia and properly gives credit, there are several that take content and don't give credit. These sites are AdSense carriers, art galleries, travel agencies offering tour packages, sites selling classical music CDs, and everything else you can imagine. All the URLs were then sorted by domain, and each domain checked once manually. This took several hours a day for two weeks, and produced a list of 965 "scraper" domains. They were put in a file that I call "rogues."
I back-purged my data from Google and deleted any files that only had data from rouges. That got rid of 71 percent of the noise, and now I had just 1,682 files. I looked at each of these Wikipedia files side by side with the suspicious URL, and discovered that there was still a massive amount of noise. After an additional two weeks of manual purging, I ended up with 150 files. These were then looked at once again, and the highlighting inserted manually. Another eight were rejected, leaving the 142 examples of plagiarism below.
The bottom line is that I tried to investigate plagiarism using a sample of about one percent of Wikipedia's 1.46 million English-language articles. I found plagiarism in about one percent of those I examined.
This one percent figure is conservative. For one thing, the nature of my original sample meant that several public-domain encyclopedias covered many of the biographies. For example, the 1911 Britannica is in the public domain (over 12,000 Wikipedia articles incorporate text from this edition of Britannica). An old Catholic encyclopedia, a Jewish encyclopedia, and an Australian encyclopedia are also in the public domain, and some government sites have public-domain history sections. I ended up excluding many articles as soon as I saw an attribution on the Wikipedia article, indicating that portions were copied from a public domain source.
Another reason my one percent figure is conservative is that my average of 2.38 sentences per article undoubtedly missed a lot of plagiarized content. If the entire Wikipedia article was plagiarized, I should have caught it. But frequently a couple of paragraphs only are plagiarized, and my sentences could have been from non-plagiarized portions of the Wikipedia article. Finally, I assumed that the original content was still online, and that Google indexed it, and that Google's algorithm performed well enough to produce it.
Because of these limitations, I believe that the actual plagiarism rate on Wikipedia is at least two percent, instead of the one percent I was able to find.
If Wikipedia's editors can plagiarize from others, does this mean that reporters can plagiarize from Wikipedia? Not at all. Don't even think about it. When Wikipedians catch a reporter using material from Wikipedia without attribution, they become indignant and complain to his editor.
Implications of this study
The position of Wikimedia Foundation is that it is a service provider, and any issues raised by plagiarism, insofar as they are also copyright violations, are covered under the Digital Millennium Copyright Act (DMCA). Jimmy Wales is the registered agent for the Foundation under the DMCA. Basically, this position means that once the Foundation is made aware of a specific problem, a take-down of the violating material absolves the Foundation of liability.
However, it is not clear that Wikimedia Foundation qualifies as a service provider. It owns and operates Wikipedia's servers and its employees have ultimate control over Wikipedia software. A hierarchy of power to moderate content is controlled by this software. A Wikipedian can be an arbitrator, a steward, a bureaucrat, an administrator, a user, or an "anon" (the last is a user who edits without a username, and only the IP address is shown). There are over one thousand administrators, who have the power to ban or block users and anons, and delete content, or protect and unprotect articles.
This formal structure of power is designed to enforce the many policies, formal and informal, about what constitutes appropriate content on Wikipedia. I believe that this structure in itself means that the Foundation is already much closer to a "publisher" than a "service provider." In the future this will become even more evident. Wikipedia is available for downloading as a complete set of articles, and already it is appearing pre-installed on third-party hardware, from flash memory sticks to servers. There is an agreement to pre-install it on those $120 laptops destined for Africa. One hears talk about a possible print version of Wikipedia.
The U.S. Copyright Office is bound by a definition of "service provider" because of the language in the DMCA:
The fact that the Office has accepted a designation of an agent and has included it in the Office's directory of agents should not be construed as a judgment by the Office that the designation is sufficient or error-free.The moderation that occurs on Wikipedia apparently disqualifies the Foundation as a "service provider," based on that last phrase. If this is true, then what are the implications for Wikipedia? If it is a publisher rather than a service provider, what does this mean?
Definition: For purposes of section 512(c), a "service provider" is defined as a provider of online services or network access, or the operator of facilities therefor, including an entity offering the transmission, routing, or providing of connections for digital online communications, between or among points specified by a user, of material of the user's choosing, without modification to the content of the material as sent or received. [emphasis added]
Primarily it means that the Foundation requires more due diligence to avoid copyright violations. Administrators already make efforts to patrol copyright violations on images posted by users to illustrate articles, but no meaningful efforts have ever been made to detect plagiarism. If the Foundation is not protected by the special status of "service provider" as defined by the DMCA, then a lack of prior due diligence will increase the Foundation's liability for copyright violations. ( A similar situation, discussed elsewhere, exists with respect to defamation of character and invasion of privacy issues, due to Section 230 of the Communications Decency Act. )
I believe that the Foundation should launch a project to scan for plagiarism on all 1.46 million articles. This is a major task. Requesting special but temporary automated-access arrangements with Google and/or Yahoo is the easy part. The hard part is separating the signal from the noise. Articles that have been in Wikipedia more than a few months have been scraped widely, and attempting to determine who plagiarized whom is not something a program can do. But a thousand administrators and tens of thousands of eager users can do it. As just one person, I did one percent of Wikipedia in six weeks, which shows that it can be done and that the results are worthwhile. Now it's up to Wikimedia Foundation to finish the job.
(about a user who plagiarized a few sentences from two movie descriptions)
We need to deal with such activities with absolute harshness, no mercy, because this kind of plagiarism is 100% at odds with all of our core principles. All admins are invited to block any and all similar users on sight. Be bold. If someone takes you to ArbCom over it, have no fear. We must not tolerate plagiarism in the least.
— Jimbo Wales 04:28, 28 December 2005There is no need nor intention to be vindictive, but at the same time, we can not tolerate plagiarism. Let me say quite firmly that for me, the legal issues are important, but far far far more important are the moral issues. We want to be able, all of us, to point at Wikipedia and say: we made it ourselves, fair and square.
— Jimbo Wales 15:54, 28 December 2005
Labels: Plagiarism by Wikipedia editors