Need a paper? Fake it

For untenured professors, the pressure to publish is intense.

May 9, 2005
Jeffrey Perkel
<p/>

For untenured professors, the pressure to publish is intense. But it's unlikely any professor would be desperate enough to use a tool that a trio of Massachusetts Institute of Technology graduate students recently dreamed up. Fed up with announcements for conferences they deemed of dubious scientific integrity, Jeremy Stribling, Max Krohn, and Dan Aguayo produced a Web-based application called SCIgen http://www.pdos.csail.mit.edu/scigen/ to generate random, nonsense computer science papers-complete with figures, graphs, and references. It worked: One of their papers, entitled "Rooter: A Methodology for the Typical Unification of Access Points and Redundancy," was accepted without review for the 9th World Multi-Conference on Systemics, Cybernetics and Informatics, to be held in July in Orlando, Florida.

The system relies on "skeleton sentences" and a lexicon of computer science buzzwords. Stribling, Krohn, and Aguayo combed the computer science literature to compile the lexicon, which Stribling says contains 3,500 or so phrases and terms. "It's kind of like Mad Libs," says Stribling. "You have sentences and blanks you have to fill in." The resulting text is grammatically correct, but meaningless.

Repurposing SCIgen for the life sciences, or any field for that matter, would be relatively straightforward, says Stribling, as long as the field is rife with buzzwords. "You'd have to write new sentences," he says. "That took us a week to two weeks, pretty solid work." All it would take, says Mark Craven, an assistant professor of biostatistics and medical informatics at the University of Wisconsin, Madison, who develops algorithms for analyzing biomedical literature, is a context-free grammar-the computer equivalent of eighth-grade sentence diagrams-and a lexicon.

That could be a significant hurdle. PubMed contains well in excess of 15 million papers, and the Medical Subject Headings listing-the database's controlled thesaurus-has more than 22,000 keywords, not counting chemical and protein names. Yet Craven says it would be simple to write a script to download a huge collection of papers-say 50,000 records-and then pull out keywords automatically. Mining the complete article text would be more challenging, he says, but is also doable.

As for Stribling et al., their success has been their undoing. Their original plan was to attend the meeting and present a randomly generated talk that was created just prior to the presentation. They collected just under $2,400 from 171 donors to finance their trip after they posted a plea for funds on their Web site. But news of their prank, spread on blogs like Slashdot.org, Fark.com, and also in the mainstream press, caused the conference to rescind their provisional acceptance. Conference organizer Nagib Callaos writes in an E-mail that he was "examining our systems and procedures, and if they were really being followed by our staff."

It's unlikely that "SCIgen-erated" papers would have any chance of making it past reviewers in a serious scientific journal, says biolinguist Lynette Hirschman. One of the "grand challenges" of cognitive science is getting a computer to do a task such that people cannot distinguish between what a computer has done and what a person has done. "The reason that this is a grand challenge is that, so far, computers have not come close on tasks such as writing a conference paper," Hirschman writes in an E-mail.

That doesn't mean it wouldn't be fun to try. Craven says he'd up the ante by adding "some discourse processing" to the software, to get some coherence from one sentence to the next. Then, he says, the result would be "less like computer-generated garbage and more like human-generated garbage." Alan Sokal, who famously published a nonsensical paper in the journal Social Text in 1996, says of the MIT prank: "My reaction is basically, more power to them."