Monday, January 18, 2010

Technology Could End This Industry


Thanks to reader A-Dave who sent along this NPR story about StatsMonkey, a program generated by Northwestern University which has the potential to computerize journalism. The program takes the input of a baseball game's boxscore and generates an article about the game - replete with stats, events, and even integrating quotations and pictures of the players.

Here's the project's website at the Intelligent Information Laboratory at Northwestern, where they explain how the program works.

The system is based on two underlying technologies. First, it uses baseball statistical models to figure out what the news is in the story: By analyzing changes in Win Probability and Game Scores, the system can pick out the key plays and players from any baseball game. Second, the system includes a library of narrative arcs that describe the main dynamics of baseball games (as well as many other competitions): Was it a come-from-behind win? Back-and-forth the whole way? Did one team jump out in front at the beginning and then sit on its lead? The system uses a decision tree to select the appropriate narrative arc. This then determines the main components of the game story and enables the system to put them together in a cohesive and compelling manner. The stories can be generated from the point of view of either team.
A few thoughts on this:

1. I'd never thought of using Win Probability to do this - to determine the most significant plays in the game as the plays which change the WP the most. For example, it seems that the program can determine that before Larry B's broken-bat single down the third base line, Stanford Hall had a 4% chance of winning the game, but after that, his team had a 10% chance of winning the game. Therefore, Larry B is more clutch than David Eckstein.

2. If the code for this project ever goes public, you can imagine the Turing tests that will take place on this blog. I'm sure the computer generated stories will be better than HatGuy's sentimental diatribes on why his favorite teams need to win more.

3. Doesn't the quality of the output of these articles depend on the phrases coded into the machine's algorithms? For example, if the folks at Northwestern hire Jay to help them come up with a set number of phrases that describe a come-from-behind home run in the bottom of the seventh, that means every article that comes out of this program is stamped with Jay's inanity.

4. Someone needs to save sports journalism. Someone needs to walk into the compounds of sports news departments all over the country and deliver the news:
Listen, sportswriters, and understand. That StatsMonkey is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity, or remorse, or fear. It knows the clutch situations in the game without even listening to the roar of the crowd and the crack of the bat. And it absolutely will not stop, ever, until your industry is dead.
I think this job falls to pnoles.

11 comments:

Larry B said...

The one problem I see is that it can't tell whether a defensive play in a high leverage situation was routine or mind-fuckingly awesome. For example, if the Mets had ended up winning game 7 of the 2006 NLCS, clearly the story of the game would be Endy Chavez's unbelievable catch which robbed a Cardinal of a HR. But in the box score, it's just an F7. Even "F7- deep" doesn't really give the program anything to write about.

Other than that, this sounds pretty badass.

Chris W said...

larry b broke an aluminum bat? Who was pitching? Mtess's cousin?

Elliot said...

Can a computer make horribly dated references that are in no way applicable to the subject at hand? I don't think so.

Austin said...

a Dave in need is A-Dave indeed.

Chris W said...

A Dave with weed is better

dan-bob said...

Charge it to the FJayM corporate account. Aren't you in charge of the finances anyways?

Jarrett said...

Shit, dan-bob, thanks for reminding me, I almost forgot to fill out my expense report.

red said...

This program could definitely be used to replace guys like Mariotti, but the prose of the really good sportswriters will always be needed.

Alex said...

Yeah, but can it detect irony?

Prose and cadence - I agree with Red. How can they capture that?

Personally, I'd program it to use Mariotti in iambic pentameter.

slwg said...

And I hear that the same team at Northwestern is working on BlogMonkey, a program which will critique the output of StatsMonkey.

Adam said...

Doesn't the quality of the output of these articles depend on the phrases coded into the machine's algorithms?

That's not really any different than the dopey AP writers and quoted players and managers who give no real insight and basically spout cliches.

Tell me a computer program couldn't come up with something like this:

http://scores.espn.go.com/mlb/recap?gameId=290820117&teams=san-francisco-giants-vs-cincinnati-reds