The conference has only been held five times and has the feeling of a workshop - no parallel sessions is nice, but the quality of work was very varied. Despite working on intonation and prosody since about 2006, this is the first year I've attended. Speech Prosody is a biannual conference held by a special interest group of ISCA. These are exactly the differences likely to be found between training data and any unseen data! And this is what system evaluations seek to measure in the first place. I really like this paper, even if its results show that joint training can overcome the corpus disparities (including genre differences). In addition to genre, the recording conditions, conversational participants and labelers are all distinct. One is carefully recorded professionally read news speech and the other is telephone conversations. The two corpora in this paper are the Boston University Radio News Corpus, and the Switchboard corpus. However, the impact of differences between the corpora other than genre effects get muddled. They found that cross-genre/corpus training led to poor results, but that by combining training material from both genres/corpora, performance improved. They were looking at the impact of genre on prosodic event detection. I was reminded about this issue around Anna Margolis, Mari Ostendorf and Karen Livescu's paper at Speech Prosody (see last post). When we make claims like "it's harder to recognize spontaneous speech than read speech" usually what's being said is "my performance is lower on the spontaneous material I tested than on the read material I tested."
Where this gets to be more of a problem is when corpus effects are considered to be genre effects. This isn't terrible, just an overly broad claim.
It's usually impossible to know if the effects are broad enough to be consistent across the genre of speech, or if they are specific to the examined material - the corpus. Either you've made some broad (hopefully interesting or impactful) observation about speech or you're able to claim expected performance of an approach on unseen examples of speech - from the same genre. Big bold claims of the sort "Conversation speech is like X" (for descriptive statistics) or "This algorithm performs with accuracy Y +- Z on broadcast conversation speech" (for evaluation tasks). We'd all like to make claims about the nature of speech. The observations here also apply to text in NLP where genre can be used to characterize newswire text, blog posts, blog comments, email, IM, fictional prose, etc. The list goes on because it's not particularly rigorously defined term. The term "genre" gets used to broadly describe the context of speech - read speech, spontaneous speech, broadcast news speech, telephone conversation speech, presentation speech, meeting speech, multiparty meeting speech, etc. Thanks to Slashdot via Machine Learning (Theory) for the heads up on this one. When I get in, I'll post some comparison results between Google Predict and a variety of other open source tools here. It'll be interesting to see if anyone can suss out what is going on under the hood. In all likelihood this uses the Seti infrastructure to do the heavy lifting, but there's at least a little bit of feature extraction thrown in to handle the text input. There are no parameters to set and you can't get a confidence score out. It " Automatically selects from several available machine learning techniques", and it supports numerical values and unstructured text as input. The downside of this simplified process is that Google Predict works as a black box classifier (and maybe regressor?). The Prediction API strips all that away: label some rows of data and let 'er rip. Even with open source tools like weka, current interfaces are intimidating at best and require knowledge of the field. There needs to be greater interaction with machine learning from all walks of life. Rather than release a toolkit, in typical Google fashion, they've set it up as a webservice. Google has taken a step into the open machine learning with Google Predict.