Invited Speakers


Devi Parikh
Virginia Center for Autonomous Systems, Virginia.

Title: Learning Common Sense Through Visual Abstraction

Abstract: Common sense is a key ingredient in building intelligent machines that make "human-like" decisions when performing tasks -- be it automatically answering natural language questions, or understanding images and videos. How can machines learn this common sense? While some of this knowledge is explicitly stated in human-generated text (books, articles, blogs, etc.), much of this knowledge is unwritten. While unwritten, it is not unseen! The visual world around us is full of structure bound by commonsense laws. But machines today cannot learn common sense directly by observing our visual world because they cannot accurately perform detailed visual recognition in images and videos. This leads to a chicken-and-egg problem: we would like to learn common sense to allow machines to understand images accurately, but in order to learn common sense, we need accurate image parsing. We argue that the solution is to give up on photorealism. In this talk, I will describe our efforts in leveraging abstract scenes -- cartoon scenes made from clip art by crowd sourced humans -- to teach our machines common sense.


Andrea Vedaldi
Visual Geometry Group, Oxford.

Title: Synthetically-augmented data for deep text spotting

Abstract: In this talk I will discuss synthetic data augmentation as a strategy for generating large quantities of supervised training data for deep learning. This approach combines two common methods: data augmentation, which generates new training images by transforming existing ones, and synthetic data generation, which creates training images using computer graphics. Synthetic data augmentation transforms real images by inserting virtual objects obtained using computer graphics.

I will discuss the importance of realism in synthetic data augmentation, and show how computer vision techniques such as monocular depth estimation can be used to automatically insert virtual objects in a way which is geometrically consistent with a given scene geometry. I will show that by using such techniques it is possible to construct datasets that are orders of magnitude larger than manually collected ones while being sufficiently realistic for the purpose of machine learning for image understanding.

I will demonstrate these ideas in the context of text spotting. First, I will introduce a synthetic dataset, Synth Text, and show how this can be used to train deep state-of-the-art neural networks for text recognition in natural scenes without using any real image. Then, I will introduce a synthetically-augmented dataset, Synth Scene Text, and use the latter to train deep networks for text detection in natural scenes.


Ankur Handa
Dyson Lab Imperial College, London

Title: Understanding Real World Indoor Scenes: Geometry and Semantics

Abstract: Scene understanding for indoor scenes has predominantly dealt with the rich (real-time) reconstructions in the past. It is only lately that CNNs have brought a fresh wave for understanding scenes beyond just geometry, allowing us to reason at the level of objects and their identities within the 3D reconstructed map. We propose, SceneNet, as a library of labelled synthetic 3D scenes to collect large scale data often required for training CNNs and show how computer graphics can be leveraged to aid scene understanding at the level of objects. We show improvements in the results when compared to standard real-world datasets in NYUv2 and SUN RGB-D. We will then look into our recently proposed library gvnn --- built in torch --- aimed at bridging the gap between deep learning and geometry and provides various geometric computer vision modules as layers that can be inserted in a CNN to enable end-to-end learning of place recognition, depth estimation and pose estimation both for supervised and unsupervised settings. We will then briefly touch upon the role of geometry for unsupervised learning and physical scene understanding.


Thomas Brox
University of Freiburg, Germany

Title: Training Deep Networks using Rendered Scenes

Abstract: Computer vision research is currently dominated by deep networks within the supervised learning paradigm. Only with large amounts of data, deep learning can show its full potential. Thus, lots of resources have been spent on collecting datasets and complementing it with human annotation of class labels, bounding boxes, or even segmentation masks. Outside the field of recognition, collecting such datasets is not just tedious but simply impossible: humans cannot provide a ground truth optical flow field or an accurate 3D reconstruction of a scene. Rendered datasets are a general alternative to real images with added human annotation, and for some learning tasks they are even the only possible way to train a deep network. Rather than rendering just the input images, one can render almost every desired output. A common critique of such procedure is the missing realism of such data. We show that at least for low-level computer vision tasks, such as optical flow or disparity estimation, realism is not the major component to make network training successful. I will also show current results of the brand new FlowNet 2.0


Bryan Russell
Adobe Research, USA

Title: Bridging the real-rendered view appearance gap

Abstract: In this talk I will discuss two approaches to retrieve and align 3D models for objects depicted in 2D still images. The first leverages surface normal predictions, along with appearance cues. Critical to its success is the ability to recover accurate surface normals for objects in the depicted scene. Our method achieves state-of-the-art accuracy on the NYUv2 RGB-D dataset for surface normal prediction, and recovers fine object detail compared to previous methods. Furthermore, we develop a two-stream network over the input image and predicted surface normals that jointly learns pose and style for CAD model retrieval. When using the predicted surface normals, our two-stream network matches prior work using surface normals computed from RGB-D images on the task of pose prediction, and achieves state of the art when using RGB-D input. Finally, our two-stream network allows us to retrieve CAD models that better match the style and pose of a depicted object compared with baseline approaches. The second is an end-to-end CNN for 2D-3D exemplar detection. We demonstrate that the ability to adapt the features of natural images to better align with those of CAD rendered views is critical to the success of our technique. We show that the adaptation can be learned by compositing rendered views of textured object models on natural images. Our approach can be naturally incorporated into a CNN detection pipeline and extends the accuracy and speed benefits from recent advances in deep learning to 2D-3D exemplar detection.


Vladlen Koltun
Intel Visual Computing Lab, Santa Clara, USA

Title: Playing for Data: Ground Truth from Computer Games

Abstract: Recent progress in computer vision has been driven by high-capacity models trained on large datasets. Unfortunately, creating large datasets with pixel-level labels has been extremely costly due to the amount of human effort required. In this paper, we present an approach to rapidly creating pixel-accurate semantic label maps for images extracted from modern computer games. Although the source code and the internal operation of commercial games are inaccessible, we show that associations between image patches can be reconstructed from the communication between the game and the graphics hardware. This enables rapid propagation of semantic labels within and across images synthesized by the game, with no access to the source code or the content. We validate the presented approach by producing dense pixel-level semantic annotations for 25 thousand images synthesized by a photorealistic open-world computer game. Experiments on semantic segmentation datasets show that using the acquired data to supplement real-world images significantly increases accuracy and that the acquired data enables reducing the amount of hand-labeled real-world data: models trained with game data and just 1/3 of the CamVid training set outperform models trained on the complete CamVid training set.


Adrien Treuille
VP of Simulation at Zoox / Assistant Professor at CMU, San Francisco, USA

Title: Automatic Scenario Generation for Training and Testing Autonomous Vehicles

Abstract: 3D simulation offers the opportunity automate the training and testing of autonomous robots. At scale, however, this approach begs the question -- how can we enumerate the dramatically diverse range of scenarios faced by a consumer-grade autonomous robot? This talk presents our work at Zoox Inc. designing a domain-specific language to describe scenarios for autonomous driving. Formalizing this language enables us to approach safety testing and training from a mathematical angle, and offers perhaps the most concrete approach to comprehensive performance evaluation for self-driving cars.