Representing human performance at high-fidelity is an essential building block in diverse applications, such as
film production, computer games or videoconferencing. To close the gap to production-level quality, we introduce
HumanRF, a 4D dynamic neural scene
representation that captures full-body appearance in motion from multi-view video input, and enables playback
from novel, unseen viewpoints. Our novel representation acts as a dynamic video encoding that captures fine
details at high compression rates by factorizing space-time into a temporal matrix-vector decomposition. This
allows us to obtain temporally coherent reconstructions of human actors for long sequences, while representing
high-resolution details even in the context of challenging motion.
While most research focuses on synthesizing at resolutions of 4MP or lower, we address the challenge of
operating at 12MP. To this end, we introduce ActorsHQ, a
novel multi-view dataset that provides 12MP footage from 160 cameras for 16 sequences with high-fidelity,
per-frame mesh reconstructions. We demonstrate challenges that emerge from using such high-resolution data and
show that our newly introduced HumanRF effectively leverages this data, making a significant step towards
production-level quality novel view synthesis.
Given a set of input videos of a human actor in motion, captured in a multi-view camera setting, our goal is to enable temporally consistent, high-fidelity novel view synthesis. To that end, we learn a 4D scene representation using differentiable volumetric rendering, supervised via multi-view 2D photometric and mask losses that minimize the discrepancy between the rendered images and the set of input RGB images and foreground masks. To enable efficient photo-realistic neural rendering of arbitrarily long multi-view data, we use sparse feature hash-grids in combination with shallow multilayer perceptrons (MLPs).
As illustrated in the figure above, the core idea of HumanRF is to partition the time domain into optimally
distributed temporal segments, and to represent each segment by a compact 4D feature grid. For this purpose, we
extend the TensoRF vector-matrix decomposition (designed for static 3D scenes) to support time-varying 4D
feature grids.
Our dataset, ActorsHQ, consists of 39, 765 frames of dynamic human motion captured using multi-view video. We used a proprietary multi-camera capture system combined with an LED array for global illumination. The camera system comprises 160 12MP Ximea cameras operating at 25fps. Close-up details that are captured at this resolution are highlighted in the figures below. The lighting system provides a programmable lighting array of 420 LEDs that are time-synchronized to the camera shutter. All cameras were set to a shutter speed of 650us to minimize motion blur for fast actions.
To request the dataset, please ask your group leader/PI to sign the downloaded release form, scan it and email it to dataset@actors-hq.com along with their name, title and organisation. We note that both the form and the email should come from the group leader that has a permanent position at the corresponding institution.
The following model viewer shows examples of the per-frame mesh reconstructions for the actors in the dataset. It also visualizes the camera locations -- please note that you can adjust the viewpoint with the cursor.
@article{isik2023humanrf,
title = {HumanRF: High-Fidelity Neural Radiance Fields for Humans in Motion},
author = {I\c{s}{\i}k, Mustafa and Rünz, Martin and Georgopoulos, Markos and Khakhulin, Taras
and Starck, Jonathan and Agapito, Lourdes and Nießner, Matthias},
journal = {ACM Transactions on Graphics (TOG)},
volume = {42},
number = {4},
pages = {1--12},
year = {2023},
publisher = {ACM New York, NY, USA},
doi = {10.1145/3592415},
url = {https://doi.org/10.1145/3592415}
}