In this talk, we show how the open-source movie Sintel can provide benchmarks for different computer vision scenarios. To compare different algorithms, the computer vision community needs common benchmarks, i.e. datasets of images or videos with additional information, such as segmentations or per-pixel motion. Creating such datasets is hard and time consuming. Therefore, they are often small, simplistic (e.g. contain only static scenes), or, in case of synthetic datasets, lack visual realism.
In contrast, the Sintel production data contains large amounts of 3D data that is more complex and visually more realistic than existing synthetic datasets. However, the structure of this data is determined by the artists, and does not necessarily correspond to the representations used by computer vision research to reconstruct the structure of a 3D scene. For example, "objects" in Blender scenes often do not correspond to real objects.
This talk shows how we modified Blender and Sintel to extract data and convert it to appropriate representations, making it usable for computer vision benchmarks. We present three scenarios: optical flow (per-pixel motion estimation), segmentation, and depth estimation, all of which are of great interest to the computer vision community.