If you’ve ever uploaded a photo to Facebook and had the social networking site accurately prompt you to tag your or a friend’s face in said photo, you’ve seen computer vision in action. Computers are getting better at seeing faces and objects in images, thanks to new methods and tools developed by researchers in the field of computer vision.
One such researcher is Cornell Tech’s Serge Belongie, Computer Science. When Belongie first started working in computer vision as an undergraduate at California Institute of Technology (Caltech) in the early 1990s, the images used in experiments were much lower resolution than the ones we see on our screens today.
“We were more intimately in touch with every pixel,” he says. “The sense that an image is nothing more than an array of numbers to a computer was hard to escape, which made it all the more mystifying to contemplate how humans perceive meaning in them.”
Belongie became interested in getting computers to glean such meaning out of images, and he hasn’t let go since. Today, he works on several projects involving computer vision, machine learning, and human-in-the-loop computing—computing that involves human interaction.
Want to look something up? Google it. Check Wikipedia. But what happens if you don’t know what something is called? What if you’re starting off with an image, say of a bird or a butterfly, and lack the vocabulary to effectively describe it in detail? That’s where Visipedia comes in.
Visipedia was born out of collaboration between Belongie and Pietro Perona at Caltech. The two were interested in organizing visual data, while simultaneously capturing and sharing the knowledge of experts so that everyday users could locate information through imagery.
Visipedia is essentially a visual encyclopedia, funded by a Google Focused Research Award, and the search input is an image instead of text. It’s powered by heaps of image data, expert knowledge, and community engagement, and it’s built for specificity. In the beginning, Visipedia is “like a big empty tree,” Belongie says. His group then works with experts in a particular field to fill Visipedia with data, or knowledge.
In collaboration with experts at Cornell’s Lab of Ornithology, Belongie’s group has spent the past year building the first app powered by Visipedia—a bird recognition app that can identify more than 500 North American bird species. Thanks to image data from the Lab of Ornithology, and thousands of volunteers who worked on uploading bird taxonomy, the app has more than a million crowdsourced image annotations. It debuted in 2015.
Without any help, the app can identify a bird from a photo with around 80 percent accuracy. But with a bit of a human nudge—by asking a user to click on the beak, eye, and tail of the bird—the app will provide around 90 percent accuracy, a very high rate for computer vision today. This is where human-in-the-loop methods can elevate computational ability, says Belongie.
“A feature of this kind of system is that we can offer something fun and useful to real-world users, while simultaneously applying machine learning to the data we collect to improve the algorithms,” Belongie adds.
Though Belongie says his group approached the project with a computer science perspective, he considers Visipedia more of a “CSx” concept, meaning it is computer science with an extra angle of community engagement.
Visipedia’s platform and software can be used for many scenarios other than birds. Belongie’s group is currently in talks with a network of Lepidoptera experts to develop an app for identifying moth and butterfly species. Ultimately, Belongie wants Visipedia to augment Wikipedia with “search-by-image” functionality for everything from animals to plants to clothing to architecture.
Teaching a Computer to Understand Visual Scenes
While Visipedia provides ultra-specific results in fine-grained categories, the aim of COCO, Common Objects in Context, is to recognize general objects from one another. COCO is led by Belongie’s graduate student Tsung-Yi Lin and involves collaborators across industry and academia, including Facebook, Microsoft, Caltech, UC Irvine, and Brown University. It is an image recognition, segmentation, and captioning data set that includes more than 300,000 images with more than 2.5 million annotations on the images for more than 100 common objects (e.g., backpack, chair, sheep).
The data set is less concerned with recognizing questions of specificity, like identifying a type of bird, and more concerned with recognizing a bird from a car or a dog. But like Visipedia, COCO draws on similar concepts around Belongie’s research interest of teaching computers to see.
Unlike Visipedia, COCO is made with basic research in mind. “It is now the data set that nearly every group that is working on image recognition uses,” says Belongie. “A system trained on COCO won’t tell you what you didn’t already know, but it gets at what really motivates object recognition researchers,” which is teaching a computer to understand visual scenes.
Without any help, the app can identify a bird from a photo with around 80 percent accuracy.
COCO addresses three core research problems in computer scene understanding: detecting unusual perspectives of objects, contextual reasoning, and precise localization of objects. Through more than 70,000 worker and researcher hours, the collection of images is extremely thorough and well annotated.
Companies like Microsoft and Facebook are interested because they want it to help them generate context-sensitive ads—once a computer knows what’s in an ad, it can place the ad accordingly. For researchers like Belongie, however, it’s an ideal data set for advancing object detection and segmentation algorithms.
Why Cornell Tech?
Belongie came to Cornell Tech from University of California, San Diego in 2014. He says he learned about the plans Dan Huttenlocher, Dean of Cornell Tech, and Greg Pass, Chief Entrepreneurial Officer for Cornell Tech, had for Cornell Tech and realized he wanted to be a part of it.
Belongie says, “The effort Greg is leading to create a Dev Team—a team of developers-in-residence who work with faculty and their research groups to bring their research prototypes to life—is something unique that makes me excited about being at Cornell Tech.”
Much like his current projects, Belongie says that his future research will explore object and scene recognition, along with the potential of human computation. He points out that the community is currently enthralled with Big Data. But he aims to combine such efforts with human-in-the-loop computation, through marketplaces such as Amazon Mechanical Turk, to help machines “unlock the remaining secrets of images, video, speech, and language.”