What percentage of news would be written by computers in 15 years?

May 8th, 2012

15%. A very cool piece in Wired that looks at whether an algorithm can write a news story better than a human reporter.

Hammond assures me I have nothing to worry about. This robonews tsunami, he insists, will not wash away the remaining human reporters who still collect paychecks. Instead the universe of newswriting will expand dramatically, as computers mine vast troves of data to produce ultracheap, totally readable accounts of events, trends, and developments that no journalist is currently covering.

That’s not to say that computer-generated stories will remain in the margins, limited to producing more and more Little League write-ups and formulaic earnings previews. Hammond was recently asked for his reaction to a prediction that a computer would win a Pulitzer Prize within 20 years. He disagreed. It would happen, he said, in five.

And this bit about how engineers can teach computers to tweak narratives is fascinating:

The startup’s first customer was a TV network for the Big Ten college sports conference. The company’s algorithm would write stories on thousands of Big Ten sporting events in near-real time; its accounts of football games updated after every quarter. Narrative Science also got assigned the women’s softball beat, where it became the country’s most prolific chronicler of that sport.

But not long after the contract began, a slight problem emerged: The stories tended to focus on the victors. When a Big Ten team got whipped by an out-of-conference rival, the resulting write-ups could be downright humiliating. Conference officials asked Narrative Science to find a way for the stories to praise the performances of the Big Ten players even when they lost. A human journalist might have blanched at the request, but Narrative Science’s engineers saw no problem in tweaking the software’s parameters—hacking it to make it write more like a hack. Likewise, when the company began covering Little League games, it quickly understood that parents didn’t want to read about their kids’ errors. So the algorithmic accounts of those matchups ignore dropped fly balls and focus on the heroics.

When reading this process of how a story is generated, think how much more work it would be for a computer to write a legal brief:

Narrative Science’s writing engine requires several steps. First, it must amass high-quality data. That’s why finance and sports are such natural subjects: Both involve the fluctuations of numbers—earnings per share, stock swings, ERAs, RBI. And stats geeks are always creating new data that can enrich a story. Baseball fans, for instance, have created models that calculate the odds of a team’s victory in every situation as the game progresses. So if something happens during one at-bat that suddenly changes the odds of victory from say, 40 percent to 60 percent, the algorithm can be programmed to highlight that pivotal play as the most dramatic moment of the game thus far. Then the algorithms must fit that data into some broader understanding of the subject matter. (For instance, they must know that the team with the highest number of “runs” is declared the winner of a baseball game.) So Narrative Science’s engineers program a set of rules that govern each subject, be it corporate earnings or a sporting event. But how to turn that analysis into prose? The company has hired a team of “meta-writers,” trained journalists who have built a set of templates. They work with the engineers to coach the computers to identify various “angles” from the data. Who won the game? Was it a come-from-behind victory or a blowout? Did one player have a fantastic day at the plate? The algorithm considers context and information from other databases as well: Did a losing streak end?

Then comes the structure. Most news stories, particularly about subjects like sports or finance, hew to a pretty predictable formula, and so it’s a relatively simple matter for the meta-writers to create a framework for the articles. To construct sentences, the algorithms use vocabulary compiled by the meta-writers. (For baseball, the meta-writers seem to have relied heavily on famed early-20th-century sports columnist Ring Lardner. People are always whacking home runs, swiping bags, tallying runs, and stepping up to the dish.) The company calls its finished product “the narrative.”

And some interesting discussion how data can be gleaned from a baseball game--but are managers listening:

But even if Narrative Science never does learn to produce Pulitzer-level scoops with the icy linguistic precision of Joan Didion, it will still capitalize on the fact that more and more of our lives and our world is being converted into data. For example, over the past few years, Major League Baseball has spent millions of dollars to install an elaborate system of hi-res cameras and powerful sensors to measure nearly every event that’s occurring on its fields: the velocities and trajectories of pitches, tracked to fractions of inches. Where the fielders stand at any given moment. How far the shortstop moves to dive for a ground ball. Sometimes the real story of the game may lie within that data. Maybe the manager failed to detect that a pitcher was showing signs of exhaustion several batters before an opponent’s game-winning hit. Maybe a shortstop’s extended reach prevented six hits. This is stuff that even an experienced beat writer might miss. But not an algorithm.

H/T Corey Carpenter