Procedural Human Action Videos

Procedural Generation of Videos to Train Deep Action Recognition NetworksProcedural Human Actions

César Roberto de Souza, Adrien Gaidon, Yohann Cabon, Antonio Manuel López Peña
Deep learning for human action recognition in videos is making significant progress, but is slowed down by its dependency on expensive manual labeling of large video collections. In this work, we investigate the generation of synthetic training data for action recognition, as it has recently shown promising results for a variety of other computer vision tasks. We propose an interpretable parametric generative model of human action videos that relies on procedural generation and other computer graphics techniques of modern game engines. We generate a diverse, realistic, and physically plausible dataset of human action videos, called PHAV for “Procedural Human Action Videos”. It contains a total of 39,982 videos, with more than 1,000 examples for each action of 35 categories. Our approach is not limited to existing motion capture sequences, and we procedurally define 14 synthetic actions. We introduce a deep multi-task representation learning architecture to mix synthetic and real videos, even if the action categories differ. Our experiments on the UCF101 and HMDB51 benchmarks suggest that combining our large set of synthetic videos with small real-world datasets can boost recognition performance, significantly outperforming fine-tuning state-of-the-art unsupervised generative models of videos.

Download.  The dataset is available for download through BitTorrent:

The dataset can be downloaded from those links using a torrent application such as Free Download Manager or Deluge. Please note that since the dataset has been divided into independent torrent files for each modality, it is possible to download only the data modalities of your interest.

After you have downloaded the videos, you can use the following bash script to uncompress all files under a common folder hierarchy that you can use in your experiments. Create a file named “” in the the directory that contains the sub-directories of each of the data modalities, copy and paste the following code:

for file in */*.tar.bz2
     tar xvkjf ${file} --strip-components 1 -C videos

And then execute it using


Online generator. You can also interact with our generator online! We have configured a demo website where users can select parameters of the videos they would like to see. The videos in the website are updated every night as they become available from our generator. The website also exposes a REST API that you can use to access videos with particular parameters on-demand.

Both this website and our Unity-based video generation software that runs behind it will be shown during our demo session at CVPR’17. Please look for us on Sunday July 24th or at any other time in the NAVER LABS booth!

Attribution. When using or referring to this dataset in your research, please cite cite our CVPR 2017 paper [arxiv]:

Procedural Generation of Videos to Train Deep Action Recognition Networks
Cesar Roberto de Souza, Adrien Gaidon, Yohann Cabon, Antonio Manuel Lopez Pena
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017

We provide the following .bibtex entry for convenience:

    author = {De Souza, CR and Gaidon, A and Cabon, Y and Lopez Pena, AM },
    title = {Procedural Generation of Videos to Train Deep Action Recognition Networks},
    booktitle = {CVPR},
    year = {2017}

Data modalities. The PHAV dataset include multiple data modalities for the same video. These are:

Modality Example
Post-processed RGB Frames. Those are the RGB frames that constitute the action video. They are rendered at 340×256 resolution and 30 FPS such that they can be directly feed to Two-Stream style networks. Those frames have been post-processed with 2x Supersampling Anti-Aliasing (SSAA), motion blur, bloom, ambient occlusion, screen space reflection, color grading, and vignette.  
Ground Truth Optical Flow. Those are the ground-truth (forward) optical flow fields computed from the current frame to the next frame. We provide separate sequences of frames for the horizontal and vertical directions of optical flow represented as sequences of 16-bpp JPEG images with the same resolution as the RGB frames.  


Depth Maps. Those are depth map ground-truths for each frame. They are represented as a sequence of 16-bit grayscale PNG images with a fixed far plane of 655.35 meters. This encoding ensures that a pixel intensity of 1 can correspond to a 1cm distance from the camera plane.  
Semantic Segmentation. Those are the per-pixel semantic segmentation ground-truths containing the object class label annotations for every pixel in the RGB frame. They are encoded as sequences of 24-bpp PNG files with the same resolution as the RGB frames. We provide 63 pixel classes, including the same 14 classes used in Virtual KITTI, classes specific for indoor scenarios, classes for dynamic objects used in every action, and 27 classes depicting body joints and limbs.  
Instance Segmentation. Those are the per-pixel instance segmentation ground-truths containing the person identifier encoded as different colors in a sequence of frames. They are encoded in exactly the same way as the semantic segmentation ground-truth explained above.  
Raw Frames. Those are the raw RGB frames before any of the post-processing effects mentioned above are applied. This modality is mostly included for completeness, and has not been used in experiments shown in the paper.  
Textual Annotations. Although not an image modality, our generator can also produce textual annotations for every frame. Annotations include camera parameters, 3D and 2D bounding boxes, joint locations in screen coordinates (pose), and muscle information (including muscular strength, body limits and other physical-based annotations) for every person in a frame. You might always want to download this modality since it include also class labels and the instance labels used in other modalities in PHAV.