The Pentagon’s blue-sky researchers are funding a project that uses crowdsourcing to improve how machines analyze our speech. Even more radical: Darpa wants to make systems so accurate, you’ll be able to easily record, transcribe and recall all the conversations you ever have.
Analyzing speech and improving speech-to-text machines has been a hobby horse for Darpa in recent years. But this takes it a step further, in exploring the ways crowdsourcing can make it possible for our speech to be recorded and stored forever. But it’s not just about better recordings of what you say. It’ll lead to more recorded conversations, quickly transcribed and then stored in perpetuity — like a Twitter feed or e-mail archive for everyday speech. Imagine living in a world where every errant utterance you make is preserved forever.
University of Texas computer scientist Matt Lease has studied crowdsourcing for years, including for an earlier Darpa project called Effective Affordable Reusable Speech-to-text, or EARS, which sought to boost the accuracy of automated transcription machines. His work has also attracted enough attention for Darpa to award him a $300,000 award over two years to study the new project, called “Blending Crowdsourcing with Automation for Fast, Cheap, and Accurate Analysis of Spontaneous Speech.” The project envisions a world that is both radically transparent and a little freaky.
The idea is that business meetings or even conversations with your friends and family could be stored in archives and easily searched. The stored recordings could be held in servers, owned either by individuals or their employers. Lease is still playing with the idea — one with huge implications for how we interact.
“In their call, what [Darpa] really talked about were different areas of science where they would like to see advancements in certain problems that they see,” Lease told Danger Room at his Austin office. “So I responded talking about what I saw as this very big both need and opportunity to really make conversational speech more accessible, more part of our permanent record instead of being so ephemeral, and really trying to imagine what this world would look like if we really could capture all these conversations and make use of them effectively going forward.”
How? The answer, Lease says, is in widespread use of recording technologies like smartphones, cameras and audio recorders — a kind of “democratizing force of everyday people recording and sharing their daily lives and experiences through their conversations.” But the trick to making the concept functional and searchable, says Lease, is blending automated voice analysis machines with large numbers of human analysts through crowdsourcing. That could be through involving people “strategically,” to clean up transcripts where machines made a mistake. Darpa’s older EARS project relied entirely on automation, which has its drawbacks.
“Like other AI, it can only go so far, which is based on what the state-of-the-art methodology can do,” Lease says. “So what was exciting to me is thinking about going back to some of that work and now taking advantage of crowdsourcing and applying that into the mix.”
Crowdsourcing is all about harnessing distributed networks of people — crowds — to do tasks better and more efficiently than individuals or machines. Recently, that’s meant harnessing large numbers of people to build digital maps, raising funds for a film project at Kickstarter, or doing odd-jobs at Amazon Mechanical Turk — one system being studied as part of the project. Darpa has also taken an interest in crowdsourcing as a way to analyze vast volumes of intelligence data, and Darpa’s sibling in the intelligence community, Iarpa, has researched crowdsourcing as a way to find the best intelligence predictions.
But a few problems have to be overcome before crowdsourcing can be used to analyze speech. According to Lease, both crowdsourcing and automated systems for analyzing and transcribing speech are — by themselves — pretty weak. Audio transcripts written by humans are very accurate, but they take time to produce, and the labor is too expensive when applied on a large enough scale.
Meanwhile, automated systems are not very accurate, and require humans to copy-edit the result: adding missing punctuation marks, capitalization, and correcting for verbal disfluencies — those little noises we make when filling gaps in our speech, like “um” or “ah.” We don’t always finish our words when we talk. (But our brains are really good at not noticing it.) We change phrases mid-thought, or mistakenly begin a sentence by saying one word, only to quickly correct ourselves by switching to another word. Background noise — which has plagued voice recognition machines — can also interfere with the quality. All in all, this kind of conversational, casual speech plays havoc with our automated machines, the result being a sort of unintelligible word salad.
“There’s a linguistic sense in that conversational speech is quite different than text,” Lease says. “So we really need to think about how we make this form of our language, which is so natural to us in speech, something that is accessible to us when it’s written down, in a way that it may not naturally be.”
It also raises some thorny legal and social questions about privacy. For one, there is an issue with “respecting the privacy rights of multiple people involved,” Lease says. One solution, for a business conference that’s storing and transcribing everything said by the participants, could be a mutual agreement between all parties. He adds that technical issues when it comes to archiving recorded speech are still open questions, but people could potentially hold their cell phone conversations on remote servers; or on individual, privately-held servers.
The other problem is figuring how out how to search massive amounts of transcribed speech, like how search engines such as Google use complex algorithms to match and optimize search queries with results that are likely to be relevant. Fast and cheap web analytics — judging what people type and matching it up to what they click — is one way to do it. Studying focus groups are more precise, but expensive. A third way, Lease suggests, is using more crowdsourcing as a sort of a “middle-ground” between the two methods.
But it’s unknown how the research will be applied to the military. Lease wouldn’t speculate, and it’s still very much a basic research project. Though if it’s similar to EARS at all, then it may not be too difficult to figure out. A 2003 memorandum from the Congressional Research Service described EARS as focusing on speech picked up from broadcasts and telephone conversations, “as well as extract clues about the identity of speakers” for “the military, intelligence and law enforcement communities.” Though Lease didn’t mention automatically recognizing voices. But the research may not have to go that far — if we’re going to be recording ourselves.