Josh Blackman » Robo-Readers Can Grade Papers Just As Well As (If Not Better Than) Humans

Robo-Readers Can Grade Papers Just As Well As (If Not Better Than) Humans

April 23rd, 2012

A recently released study has concluded that computers are capable of scoring essays on standardized tests as well as human beings do.

Mark Shermis, dean of the College of Education at the University of Akron, collected more than 16,000 middle school and high school test essays from six states that had been graded by humans. He then used automated systems developed by nine companies to score those essays.

Computer scoring produced “virtually identical levels of accuracy, with the software in some cases proving to be more reliable,” according to a University of Akron news release.

“A Win for the Robo-Readers” is how an Inside Higher Ed blog post summed things up.

For people with a weakness for humans, there is more bad news. Graders working as quickly as they can — the Pearson education company expects readers to spend no more than two to three minutes per essay— might be capable of scoring 30 writing samples in an hour.

The automated reader developed by the Educational Testing Service, e-Rater, can grade 16,000 essays in 20 seconds, according to David Williamson, a research director for E.T.S., which develops and administers 50 million tests a year, including the SAT.

Is this the end? Are Robo-Readers destined to inherit the earth?

I for one welcome our new robotic-grading-overlords. Not so fast…

Les Perelman, a director of writing at the Massachusetts Institute of Technology, says no.

Mr. Perelman enjoys studying algorithms from E.T.S. research papers when he is not teaching undergraduates. This has taught him to think like e-Rater.

While his research is limited, because E.T.S. is the only organization that has permitted him to test its product, he says the automated reader can be easily gamed, is vulnerable to test prep, sets a very limited and rigid standard for what good writing is, and will pressure teachers to dumb down writing instruction.

The e-Rater’s biggest problem, he says, is that it can’t identify truth. He tells students not to waste time worrying about whether their facts are accurate, since pretty much any fact will do as long as it is incorporated into a well-structured sentence. “E-Rater doesn’t care if you say the War of 1812 started in 1945,” he said.

Mr. Perelman found that e-Rater prefers long essays. A 716-word essay he wrote that was padded with more than a dozen nonsensical sentences received a top score of 6; a well-argued, well-written essay of 567 words was scored a 5.

An automated reader can count, he said, so it can set parameters for the number of words in a good sentence and the number of sentences in a good paragraph. “Once you understand e-Rater’s biases,” he said, “it’s not hard to raise your test score.”

E-Rater, he said, does not like short sentences.

Or short paragraphs.

Or sentences that begin with “or.” And sentences that start with “and.” Nor sentence fragments.

However, he said, e-Rater likes connectors, like “however,” which serve as programming proxies for complex thinking. Moreover, “moreover” is good, too.

Gargantuan words are indemnified because e-Rater interprets them as a sign of lexical complexity. “Whenever possible,” Mr. Perelman advises, “use a big word. ‘Egregious’ is better than ‘bad.’ ”

From the Times.