A group of Stanford University researchers have data mined the short stories on the popular Toronto social publishing platform Wattpad to help their artificial intelligence “knowledge base”, a program called Augur, better understand human behaviour via Wattpad’s practically endless 600,000 chapter corpus.
In a paper called “Augur: Mining Human Behaviors from Fiction to Power Interactive Systems”, researchers outline their objective “to create a broad knowledge base of human behavior by text mining a large dataset of modern fiction”, mainly for the purpose of predicting what a human is most likely to do next in a given context.
“Fictional human lives provide surprisingly accurate accounts of real human activities,” write the researchers. “Characters in modern fiction turn on the lights after entering rooms; they react to compliments by blushing; they do not answer their phones when they are in meetings. Our knowledge base, Augur, learns these associations between activities and objects by mining 1.8 billion words of modern fiction from the online writing community Wattpad.”
The partnership with the Stanford researchers happened because Wattpad co-founder Ivan Yuen knew some of the researchers personally.
Founded in 2006, Wattpad began life uploading public domain texts in the hope that the platform would eventually achieve some kind of traction with readers.
It wasn’t until 2012, with a $17.3 million investment from Khosla Ventures, followed by an endorsement from Margaret Atwood in 2013 that Wattpad achieved critical momentum.
Atwood published a serialized novel that she co-wrote with English author Naomi Alderman entitled The Happy Zombie Sunrise Home, which went a long way to legitimizing Wattpad as a social networking hub for self-publishers.
Since then, Wattpad has taken off in southeast Asia, spawning a prime time TV series in the Phillippines called “Wattpad Presents”.
Participants in the Stanford research were outfitted with a Google Glass headset working with Augur that was able to identify a wide range of objects based on text descriptions the software had learned from reading Wattpad.
Users also had a proof-of-concept Google Glass application called Soundtrack For Life that can play a piece of contextually aware music, with Augur researchers suggesting that “you might cook to the refined arpeggios of Vivaldi, exercise to the dark ambivalence of St. Vincent, and work to the electronic pulse of the Glitch Mob.”
Using the term “dark ambivalence” to describe the music of St. Vincent suggests that the researchers have some distance to go before they can convincingly replace the music critics on Pitchfork with an algorithm.
That said, the system is achieving results that are nothing to scoff at, with Augur able to identify the next action in a given context with 71% accuracy.
Of those predictions, the system deemed 94% of those next actions to be “sensible”.
Embedding statistics gleaned by Augur into what the researchers call a “vector space model”, predictions are made through a process of machine learning based on an if-this-then-that model.
“If Augur can predict your next activity, applications can react in advance to better meet your needs in that situation,” write the researchers. “Activity predictions are particularly useful for helping users avoid problematic behaviors, like for getting their keys or spending too much money.”
It’s the mundane activity represented in the stories that the researchers decided would help them to model the next generation of technological nag guaranteed to keep people’s spending habits in line.
“While we tend to think about stories in terms of the dramatic and unusual events that shape their plots, stories are also filled with prosaic information about how we navigate and react to our everyday surroundings,” write the researchers.
“Over many millions of words, these mundane patterns are far more common than their dramatic counterparts. Characters in modern fiction turn on the lights after entering rooms; they react to compliments by blushing; they do not answer their phones when they are in meetings.”
Augur is an open source knowledge base that the researchers hope other scientists can build upon.
While students of English literature have long known that it’s entirely possible to formulate an ethical worldview, or at least a compelling outline of the best and worst of human behaviour, just by reading “Middlemarch” or “Heart of Darkness”, there has certainly never been a larger data set of fiction-oriented words to exploit than in the past decade, with the advent of fan fiction platforms like Wattpad.
However, the researchers also know that outside of Swedish best-sellers, fiction doesn’t only consist of mundane activities like losing your keys, attending children’s birthday parties, changing diapers and getting drunk.
“If fiction were truly representative of our lives,” write the researchers, “we might be constantly drawing swords and kissing in the rain.”
I don’t know what kind of fiction these guys are reading, but an artificial intelligence system that hoped to understand human memory or a convincing model for consciousness might do better to ingest “In Search of Lost Time” by Marcel Proust or the collected stories of Mavis Gallant than to trawl through half a million erotic One Direction stories on Wattpad.
Weirdly, though, the researchers introduced their own bias into the Wattpad universe by telling Augur to ignore the vast quantity of stories that repeat keywords like “dirty”, “imagine”, “One Direction” and/or “Harry Styles”, which would probably have produced a significantly different outcome for users than helping them locate their keys.
The less said about the subgenre of sexually charged Rob Ford themed fan fiction on Wattpad the better, probably.
The Wattpad-Stanford initiative isn’t the only attempt to build a machine learning platform out of digested literature.
A few months ago, Facebook released a 1.6 GB data set affiliated with an academic paper called “The Goldilocks Principle: Reading Children’s Books with Explicit Memory Representations”, which outlined how children’s literature could help enhance machine learning.