Bielefeld Workshop on Developmental Speech Recognition
Supported by Excellence Cluster Cognitive Interaction Technology (CITEC), Bielefeld University
Automatic speech recognition (ASR) systems are increasingly relevant not only for robotics but also for applications in telecommunication and entertainment. In the last decade new machine learning algorithms have been proposed in order to improve the ASR systems. Despite considerable advances none of these promising approaches is able to reach human performance. It thus seems a reasonable strategy to consider the human and especially the child-like language acquisition process as a model for the development of ASR systems.
Addressing ASR from a developmental perspective raises several new issues:
(1) Language acquisition in children is not only limited to speech but includes multimodal communication, where visual stimuli like gestures or synchronized motion enable a faster learning. Thus the speech signal is grounded in the physical world and hence tied to meaning. Such a grounding has a much higher generalisational power than standard uni-modal approaches.
(2) Developmental learning calls for new methodological approaches. Current HMM-based techniques are still difficult to apply to incremental learning. For example, the emergence of a phonetic system as it can be observed in the perception of very young infants seems to be a precursor of acoustic word or syllable learning. For a technical system this requires to build a hierarchical representation based on an earlier acquired holistic one. Technical solutions for such an emergent structure need yet to be developed.
(3) The importance of the coupling of speech production and perception in human infants has been acknowledged for a long time in phonetic science. Yet, ASR and speech synthesis approaches are being developed independently, although currently the use of HMMs in automatic speech synthesis may be seen as a good precursor for a system relying on the same representation for production and perception. The prospect of an ASR system that is able to synthesize its acquired models puts interactive speech learning within our reach.
(4) Developmental learning requires novel training data. Infants receive specifically designed input from their care-givers - motherese - and seem to profit from such modified input. However, current ASR systems would fail to make use of the benefits of motherese. New techniques and representations need to be developed in order to benefit from tutoring behavior in an interactive environment. To build such new techniques new speech corpora consisting of infant directed speech are needed.
The aim of this workshop is to bring together researchers from various fields that address these issues from different perspectives and to create an atmosphere where experts from academia and industry discuss the benefits, prospects and limitations of the state-of-the-art ASR systems and their relation with language acquisition. Furthermore we intend to raise discussions and share stimulating ideas about fundamental issues and future challenges in speech recognition considering all mentioned topics above. With these discussions we want to identify new promising directions of research in automatic speech recognition.