Thanks to reader A-Dave who sent along this NPR story about StatsMonkey, a program generated by Northwestern University which has the potential to computerize journalism. The program takes the input of a baseball game's boxscore and generates an article about the game - replete with stats, events, and even integrating quotations and pictures of the players.
Here's the project's website at the Intelligent Information Laboratory at Northwestern, where they explain how the program works.
The system is based on two underlying technologies. First, it uses baseball statistical models to figure out what the news is in the story: By analyzing changes in Win Probability and Game Scores, the system can pick out the key plays and players from any baseball game. Second, the system includes a library of narrative arcs that describe the main dynamics of baseball games (as well as many other competitions): Was it a come-from-behind win? Back-and-forth the whole way? Did one team jump out in front at the beginning and then sit on its lead? The system uses a decision tree to select the appropriate narrative arc. This then determines the main components of the game story and enables the system to put them together in a cohesive and compelling manner. The stories can be generated from the point of view of either team.A few thoughts on this:
1. I'd never thought of using Win Probability to do this - to determine the most significant plays in the game as the plays which change the WP the most. For example, it seems that the program can determine that before Larry B's broken-bat single down the third base line, Stanford Hall had a 4% chance of winning the game, but after that, his team had a 10% chance of winning the game. Therefore, Larry B is more clutch than David Eckstein.
2. If the code for this project ever goes public, you can imagine the Turing tests that will take place on this blog. I'm sure the computer generated stories will be better than HatGuy's sentimental diatribes on why his favorite teams need to win more.
3. Doesn't the quality of the output of these articles depend on the phrases coded into the machine's algorithms? For example, if the folks at Northwestern hire Jay to help them come up with a set number of phrases that describe a come-from-behind home run in the bottom of the seventh, that means every article that comes out of this program is stamped with Jay's inanity.
4. Someone needs to save sports journalism. Someone needs to walk into the compounds of sports news departments all over the country and deliver the news:
Listen, sportswriters, and understand. That StatsMonkey is out there. It can't be bargained with. It can't be reasoned with. It doesn't feel pity, or remorse, or fear. It knows the clutch situations in the game without even listening to the roar of the crowd and the crack of the bat. And it absolutely will not stop, ever, until your industry is dead.I think this job falls to pnoles.