Ideally, humans shouldn’t have to wear microphones to have a decent conversation with a robot. But background noise coming from the environment and the bot itself, can make it difficult for a bot to pick up a clear audio signal. Turning toward the person speaking tends to make the audio clearer, but first the bot has to be able to identify who is speaking and where they are located.
That’s why researchers in France have developed novel algorithms to enhance human-robot interactions by fusing audio and visual data. Researchers Xavier Alameda-Pineda and Radu Horaud of Inria have created a robust system for mapping audio and video data together to identify the number of people involved in a conversation and the location of the person talking.
According to a paper on their work which was published recently in the International Journal of Robotics Research, “finding the potential speakers and assessing their speaking status is a pillar task, on which all applications mentioned above rely. In other words, providing a robust framework to count how many speakers are in the scene, to localize them and to ascertain their speaking state, will definitely increase the performance of many audio-visual perception methods.”
The team used a NAO humanoid (which comes with an onboard stereoscopic camera and microphones) to test their hybrid deterministic/probabilistic framework. Tests run in a 5 X 5 room with furniture and varying numbers of moving people showed the system to be quite effective at detecting and localizing speakers. Interestingly, the current system puts more emphasis on the video data than the audio. Though the researchers note that there may be situations where the reverse scenario would prove more effective.
Other research groups have worked on audio-video fusion before, but those algorithms either needed a lot of tweaking before each use, couldn’t work in real time or couldn’t handle more than one human involved in the conversation. This new algorithm works in real time (17 FPS), can handle multiple speakers, and only requires one very short calibration step.