In this work we introduce a convolutional neural network (CNN) that jointly handles low-, mid-, and high-level vision tasks in a unified architecture that is trained end-to-end. Such a universal network can act like a ‘swiss knife’ for vi- sion tasks; we call this architecture an UberNet to indicate its overarching nature.
We address two main technical challenges that emerge when broadening up the range of tasks handled by a sin- gle CNN: (i) training a deep architecture while relying on diverse training sets and (ii) training many (potentially un- limited) tasks with a limited memory budget. Properly ad- dressing these two problems allows us to train accurate pre- dictors for a host of tasks, without compromising accuracy.
Through these advances we train in an end-to-end man- ner a CNN that simultaneously addresses (a) boundary de- tection (b) normal estimation (c) saliency estimation (d) se- mantic segmentation (e) human part segmentation (f) se- mantic boundary detection, (g) region proposal generation and object detection. We obtain competitive performance while jointly addressing all of these tasks in 0.7 seconds per frame on a single GPU. A demonstration of this system can be found at cvn.ecp.fr/ubernet/.