Three Routes For Embodied Models: VLA, World Models, And WAM

sky_io@outlook.com (K4i) — Thu, 18 Jun 2026 10:00:00 +0800

If a language model only has to answer with text, an embodied model has to answer one extra question: what should this sentence become as an action?

Suppose you tell a tabletop robot: “push the red cup next to the plate.” The model must identify the cup, understand “next to,” decide how the arm should move, close or release the gripper at the right moment, and recover if the cup slips. The hard part is not multimodality alone. It is the closed loop between language, vision, physical state, and continuous action: the action changes the world, and the new world changes the next action.

Jepa on k4i's blog

Three Routes For Embodied Models: VLA, World Models, And WAM