4D Gaussian Splatting with iPhones by Guillaume LARA4D Gaussian Splatting with iPhones by Guillaume LARA

4D Gaussian Splatting with iPhones

Guillaume  LARA

Guillaume LARA

4D Gaussian Splatting — iPhone Multi-Camera Volumetric Capture
A fully custom pipeline for capturing real humans in 4D — meaning reconstructed in three dimensions, across time — using nothing but four synchronized iPhones.
No dedicated volumetric studio. No expensive rig. Just synchronized mobile capture, geometry-consistent reconstruction, and a pipeline built from the ground up.
How it works
Four iPhones capture the subject simultaneously from different angles. The footage is processed through a pose estimation stage, masked per-camera for clean subject isolation, and fed into a multi-view synthesis model that generates the additional viewpoints needed for full 3D reconstruction. The final output is rendered as a 4D Gaussian Splat — a real-time, view-consistent volumetric representation of a moving human.
OUTPUT FROM IPHONE CAMERAS
OUTPUT FROM IPHONE CAMERAS
The pipeline extracts a full-body skeleton from each camera frame using Sapiens — 133 keypoints, head to toe. That pose data then guides Diffuman4D to generate intermediate viewpoints that were never filmed, but stay geometrically consistent with the real subject's movement.
The pipeline extracts a full-body skeleton from each camera frame using Sapiens — 133 keypoints, head to toe. That pose data then guides Diffuman4D to generate intermediate viewpoints that were never filmed, but stay geometrically consistent with the real subject's movement.
4 real cameras → 44 synthesized views
From 4 synchronized iPhones, Diffuman4D generates 44 geometrically consistent viewpoints around the subject. The INPUT frames (yellow) are real. Everything else is synthesized — but grounded in actual captured geometry, not diffusion hallucination.
4 real cameras → 44 synthesized views From 4 synchronized iPhones, Diffuman4D generates 44 geometrically consistent viewpoints around the subject. The INPUT frames (yellow) are real. Everything else is synthesized — but grounded in actual captured geometry, not diffusion hallucination.
Why it matters
Most AI-generated volumetric content hallucinates. It fills in what it doesn't know. This pipeline doesn't. Every frame is anchored in real geometry captured from real synchronized footage. The result is faithful to the subject — which is the only thing that matters when you're capturing a real person.
Output
Volumetric sequences suitable for immersive content, music videos, fashion, and experimental creative projects. 4K capable. Real subjects. The goal is a geometry-consistent volumetric capture workflow that doesn't require a studio budget. Everything here is reproducible with hardware most people already have.
Still active — if you have questions, suggestions, or want to discuss any part of the stack, feel free to reach out.
Like this project

Posted May 18, 2026

4 iPhones. No studio. Full volumetric human capture — geometry-consistent, no diffusion hallucination. Built from scratch, runs on consumer hardware.