Humans often learn new skills by watching other people. Ideally, robots should be able to do the same. But interpreting human actions is challenging, even for the most intelligent robots. Now researchers at the University of Maryland (UMD) and the National Information Communications Technology Research Centre of Excellence in Australia have programmed a robot that learns new skills by watching videos.
After watching annotated YouTube cooking videos, the robot was able to learn how to work with various kitchen tools.
“We chose cooking videos because everyone has done it and understands it,” said Yiannis Aloimonos, UMD professor of computer science and director of the Computer Vision Lab. “But cooking is complex in terms of manipulation, the steps involved and the tools you use. If you want to cut a cucumber, for example, you need to grab the knife, move it into place, make the cut and observe the results to make sure you did them properly.”
The researchers developed two visual recognition modules: one for classifying grasp type, and the other for object recognition. Each used multilayer learning frameworks called Convolutional Neural Networks (CNN), for classification. For example, the software can identify whether each hand in the video is using a power or precision grasp. Then it broke the activity down further to determine whether each grasp had a large or small diameter, or if the hand was at rest. Object identification worked in a similar fashion, with the software identifying the object from 48 possible classes. The software also determines what action is most likely occurring in the video clip, based on the top ten possibilities: cut, pour, transfer, spread, grip, stir, sprinkle, chop, peel, and mix. Using all of this information together, the robot is able to learn how to do the same thing on its own.
The researchers compare individual actions to words in a sentence, that can be strung together to make a point. Once a robot has learned various actions, they can be strung together to achieve a goal. This goal-oriented approach is what distinguishes the project from past work.
“Others have tried to copy the movements. Instead, we try to copy the goals. This is the breakthrough,” Aloimonos says. The approach means the robot decide how best to achieve a goal, rather than simply copying a predetermined list of actions.
In the future, the researchers hope to include more grasping types for finer categorization and possibly use the grasp type to assist with action recognition. According to their research paper, they also hope to have the system “automatically segment a long demonstration video into action clips based on the change of grasp type.”