A team of researchers is developing a system that will ultimately be capable of automatically describing a series of images based on the feelings such pictures might evoke.
“Captioning is about taking concrete objects and putting them together in a literal description,” Margaret Mitchell, a Microsoft researcher who is leading the research project, told PhysOrg. “What I’ve been calling visual storytelling is about inferring conceptual and abstract ideas from those concrete objects. A picture is worth 1,000 words. It’s not just worth three tags.”
Image Credit: PhysOrg (via Microsoft)
This is precisely why the research project, which is powered by Microsoft’s Sequential Image Narrative Dataset, doesn’t just process a single picture. Rather, it is designed to analyze a series of images – from the same event – and string together several sentences describing what is occurring.
“In image captioning, there are a lot of things we can do reasonably well, and that means we are ready for the next step,” Ting-Hao (Kenneth) Huang, a Ph.D. candidate at Carnegie Mellon University, told the publication. “I think the computer can generate a reasonably simple story, like what we see in a children’s book.”
As Rambus Fellow Dr. David G. Stork notes, the history of progress in computer vision and image comprehension has closely mirrored the stages of visual processing itself.
“Computer vision in the 1960s addressed problems in so-called ‘early vision,’ estimating the lightness or color of surfaces, finding edges and curves in images, and so forth. [These are] processes that are performed by the human retina and the first stages of processing in the human brain,” Stork explained. “Over the decades, computer vision progressed to ever more complicated and challenging problems, such as estimating motion, three-dimensional form and separating – segmenting –figures from the background. [These are all] processes that are performed in ‘higher’ stages of the human brain.”
A major challenge for scientists, says Stork, was asking computers to recognize objects amidst their manifold variations in size, orientation, position, and variations due to lighting and color.
“In recent years, a technique known as deep learning, in which networks of large numbers of simple computational ‘neurons’ are trained with many millions of images, can recognize objects and even simple scenes, such as ‘young girl throwing a ball to a dog,’” Stork continued. “Although humans generally take their ability to recognize scenes for granted, this is a remarkably difficult task. The Microsoft team is building upon decades of work in order to recognizing actions and especially intentions and goals of people within a scene.”
To illustrate his point, Stork offers the image of a child reaching for an apple on a table.
“The story behind this image is ‘the girl wants to eat the apple,’” he said. “Yet, how do we know the girl wants the apple? And how do we know she wants to eat it? Well, the team used machine learning methods applied to a large corpus of pictures and crowd-sourced stories describing those images.”
According to Stork, if the Microsoft-led team is successful, researchers will have access to new, more powerful and intuitive image searching tools.
“These complicated and distributed computer methods don’t readily expose how the task is being performed,” he added. “Then again, people themselves often have difficulty describing what goes through their mind when they analyze a scene and infer the story being portrayed.”
Leave a Reply