Recent algorithm unlocks high-resolution insights for computer vision

Imagine yourself glancing at a busy street for a couple of moments, then attempting to sketch the scene you saw from memory. Most individuals could draw the rough positions of the most important objects like cars, people, and crosswalks, but almost nobody can draw every detail with pixel-perfect accuracy. The identical is true for latest computer vision algorithms: They’re implausible at capturing high-level details of a scene, but they lose fine-grained details as they process information.

Now, MIT researchers have created a system called “FeatUp” that lets algorithms capture all the high- and low-level details of a scene at the identical time — almost like Lasik eye surgery for computer vision.

When computers learn to “see” from images and videos, they construct up “ideas” of what is in a scene through something called “features.” To create these features, deep networks and visual foundation models break down images right into a grid of tiny squares and process these squares as a bunch to find out what is going on on in a photograph. Each tiny square is normally made up of anywhere from 16 to 32 pixels, so the resolution of those algorithms is dramatically smaller than the photographs they work with. In attempting to summarize and understand photos, algorithms lose a ton of pixel clarity. 

The FeatUp algorithm can stop this loss of data and boost the resolution of any deep network without compromising on speed or quality. This enables researchers to quickly and simply improve the resolution of any latest or existing algorithm. For instance, imagine attempting to interpret the predictions of a lung cancer detection algorithm with the goal of localizing the tumor. Applying FeatUp before interpreting the algorithm using a way like class activation maps (CAM) can yield a dramatically more detailed (16-32x) view of where the tumor could be situated in line with the model. 

FeatUp not only helps practitioners understand their models, but in addition can improve a panoply of various tasks like object detection, semantic segmentation (assigning labels to pixels in a picture with object labels), and depth estimation. It achieves this by providing more accurate, high-resolution features, that are crucial for constructing vision applications starting from autonomous driving to medical imaging.

“The essence of all computer vision lies in these deep, intelligent features that emerge from the depths of deep learning architectures. The large challenge of recent algorithms is that they reduce large images to  very small grids of ‘smart’ features, gaining intelligent insights but losing the finer details,” says Mark Hamilton, an MIT PhD student in electrical engineering and computer science, MIT Computer Science and Artificial Intelligence Laboratory (CSAIL) affiliate, and a co-lead creator on a paper in regards to the project. “FeatUp helps enable the perfect of each worlds: very smart representations with the unique image’s resolution. These high-resolution features significantly boost performance across a spectrum of computer vision tasks, from enhancing object detection and improving depth prediction to providing a deeper understanding of your network’s decision-making process through high-resolution evaluation.” 

Resolution renaissance 

As these large AI models turn into increasingly prevalent, there’s an increasing need to clarify what they’re doing, what they’re , and what they’re considering. 

But how exactly can FeatUp discover these fine-grained details? Curiously, the key lies in wiggling and jiggling images. 

Particularly, FeatUp applies minor adjustments (like moving the image a couple of pixels to the left or right) and watches how an algorithm responds to those slight movements of the image. This leads to tons of of deep-feature maps which can be all barely different, which will be combined right into a single crisp, high-resolution, set of deep features. “We imagine that some high-resolution features exist, and that once we wiggle them and blur them, they may match all the original, lower-resolution features from the wiggled images. Our goal is to learn the way to refine the low-resolution features into high-resolution features using this ‘game’ that lets us understand how well we’re doing,” says Hamilton. This system is analogous to how algorithms can create a 3D model from multiple 2D images by ensuring that the expected 3D object matches all the 2D photos used to create it. In FeatUp’s case, they predict a high-resolution feature map that’s consistent with all the low-resolution feature maps formed by jittering the unique image.

The team notes that standard tools available in PyTorch were insufficient for his or her needs, and introduced a brand new kind of deep network layer of their quest for a speedy and efficient solution. Their custom layer, a special joint bilateral upsampling operation, was over 100 times more efficient than a naive implementation in PyTorch. The team also showed this latest layer could improve a wide range of various algorithms including semantic segmentation and depth prediction. This layer improved the network’s ability to process and understand high-resolution details, giving any algorithm that used it a considerable performance boost. 

“One other application is something called small object retrieval, where our algorithm allows for precise localization of objects. For instance, even in cluttered road scenes algorithms enriched with FeatUp can see tiny objects like traffic cones, reflectors, lights, and potholes where their low-resolution cousins fail. This demonstrates its capability to reinforce coarse features into finely detailed signals,” says Stephanie Fu ’22, MNG ’23, a PhD student on the University of California at Berkeley and one other co-lead creator on the brand new FeatUp paper. “This is very critical for time-sensitive tasks, like pinpointing a traffic sign on a cluttered expressway in a driverless automobile. This will not only improve the accuracy of such tasks by turning broad guesses into exact localizations, but may additionally make these systems more reliable, interpretable, and trustworthy.”

What next?

Regarding future aspirations, the team emphasizes FeatUp’s potential widespread adoption throughout the research community and beyond, akin to data augmentation practices. “The goal is to make this method a fundamental tool in deep learning, enriching models to perceive the world in greater detail without the computational inefficiency of traditional high-resolution processing,” says Fu.

“FeatUp represents a beautiful advance towards making visual representations really useful, by producing them at full image resolutions,” says Cornell University computer science professor Noah Snavely, who was not involved within the research. “Learned visual representations have turn into really good in the previous few years, but they’re almost at all times produced at very low resolution — you may put in a pleasant full-resolution photo, and get back a tiny, postage stamp-sized grid of features. That’s an issue if you should use those features in applications that produce full-resolution outputs. FeatUp solves this problem in a creative way by combining classic ideas in super-resolution with modern learning approaches, resulting in beautiful, high-resolution feature maps.”

“We hope this straightforward idea can have broad application. It provides high-resolution versions of image analytics that we’d thought before could only be low-resolution,” says senior creator William T. Freeman, an MIT professor of electrical engineering and computer science professor and CSAIL member.

Lead authors Fu and Hamilton are accompanied by MIT PhD students Laura Brandt SM ’21 and Axel Feldmann SM ’21, in addition to Zhoutong Zhang SM ’21, PhD ’22, all current or former affiliates of MIT CSAIL. Their research is supported, partially, by a National Science Foundation Graduate Research Fellowship, by the National Science Foundation and Office of the Director of National Intelligence, by the U.S. Air Force Research Laboratory, and by the U.S. Air Force Artificial Intelligence Accelerator. The group will present their work in May on the International Conference on Learning Representations.