"Animate anyone in any video to say (or sing) anything you want in any language."
sync.labs is building audio-visual models to generate, modify, and synthesize humans in video.
Founded by Prady Modukuru, Prajwal K R, and Rudrabha Mukhopadhyay
About
They’ve built a state-of-the-art lip-sync model – and they’re building towards real-time face-to-face conversations w/ AI indistinguishable from humans 🦾
Try Sync's playground here: https://app.synclabs.so/playground
How does it work?
Theoretically, their models can support any language — they learn phoneme / viseme mappings (the most basic unit / “token” of how sounds we make map to the shapes our mouths make to create them). It’s simple, but a start towards learning a foundational understanding of humans from video.
Why is this useful?
[1] They can dissolve language as a barrier
Check out how they used it to dub the entire 2-hour Tucker Carlson interview with Putin speaking fluent English.
Imagine millions gaining access to knowledge, entertainment, and connection — regardless of their native tongue.
Realtime at the edge takes us further — live multilingual broadcasts + video calls, even walking around Tokyo w/ a Vision Pro 2 speaking English while everyone else Japanese.
[2] They can move the human-computer interface beyond text-based-chat
Keyboard / mice are lossy + low bandwidth. Human communication is rich and goes beyond just the words we say. What if we could compute w/ a face-to-face interaction?
Maybe embedding context around expressions + body language in inputs / outputs would help us interact w/ computers in a more human way. This thread of research is exciting.
[3] and more
Powerful models small enough to run at the edge could unlock a lot:
eg.
Extreme compression for face-to-face video streaming
Enhanced, spatial-aware transcription w/ lip-reading
Detecting deepfakes in the wild
On-device real-time video translation
etc.