Read on why we're building World Models.

Loading...

Why Build Coda Robotics?

AI is now delivering on its long-promised potential, evident in the explosion of services across text, images, video, audio, and more. Yet one modality is still years behind - physical intelligence. Much like the early days of NLP, robotic policies have been constrained within Moravec's paradox, and have not seen the same pace of policy generalization as in other modalities.

Yet over the last 18 months, there has been a strong push to learn from the advancements in NLP and vision and apply the same breakthroughs to robotics. This push has led the best labs, both industry and academic, worldwide to collaborate towards the hope of a generalist physically embodied agent. Through these efforts Vision Language Action (VLA) models were born with Google's release of RT-1 in 2022. Since the release of RT-1 the best professors have started their own companies and billions of dollars in capital have been allocated towards robotics startups in 2024 alone. The confluence of talent, capital, and policy improvements suggests we are getting closer to a major breakthrough in robotics.

However when comparing the NLP or vision domain to robotics a lot of core principles is missing to build a robotics-first foundation model. Coda Robotics was started on the idea that we need to make significant progress on two principles seen from NLP and vision models:

1. Positive Transfer from Scale

LLMs success is largely contributed to the years of humans collectively sharing text online since the inception of the internet resulting in a large text corpus to train these text models on. It is widely known that the scaling laws applied to LLMs - training on more data improves these models. However taking the same approach towards building VLAs faces different challenges. Data collection for robotics is hard and frankly there is no way close amount of videos of robots as there is text. This has hindered the generalization of VLAs where today we are no where near GPT-2 performance. It has also led to a unique methods of scaling robot data:

01Teleoperation data

Highest quality data yet very hard to scale. Pre-training data collection efforts can take up to a year to collect the 10,000 hours needed and the costs of this end up in the hundreds of thousands. Additionally, for researchers wanting to post-train these foundation models with 10-20 hours of data per task, depending on the task complexity, they spend days collecting data for a few tasks.

02Physics simulations

Promising solution to scale data collection without quadratically increasing costs and is leveraged to pre-train recent models like Gr00t N1, yet these simulations still exhibit the widely known sim-to-real gap. Additionally, the biggest bottleneck for simulations is that the environments are generated manually which is a labor-intensive process and needs specific expertise.

03Human videos

A path to lead towards unsupervised learning and is very much favored in academia because of the significant cost cuts yet there is still an embodiment gap leading industry players not relying on this technique as much given their resources (yet).

Given the current robotics data pyramid, we have a strong belief that world models that use a combination of human videos and robot data will be able to generate high quality data at the lowest costs for model training. These world models will also support training a policy via RL which will be crucial for RL in-the-wild.

Given the current robotics data pyramid, we believe that world models leveraging both human videos and robot-collected data can generate high-quality training data at significantly lower costs. This approach not only enhances data collection efficiency but also enables scalable policy learning through reinforcement learning. Crucially, world models will serve as a backbone for training policies in the wild, where real-world data is limited, noisy, or expensive to collect.

2. Robust Safety

All paths point toward generalist models, but since they're capable of doing "anything", they must be evaluated on "everything". Labs working on pre-training robotics foundation models typically run 3,000–3,600 trials per evaluation cycle, often taking over a month since setting up the evaluation environment (real or sim) is done manually. This significantly slows down iteration.

All roads lead to generalist models, but because they can do "anything," they must be tested on everything. Robotic labs developing foundation models typically run 3,000–3,600 trials per evaluation cycle. These cycles often stretch over a month due to the manual setup required for both real-world and simulated environments, severely slowing iteration.

World models offer a scalable alternative. By simulating diverse environments, they allow rapid testing of robotic systems across out-of-distribution scenarios. We're particularly excited by their ability to replay failure events: instead of recreating scenes manually, roboticists can feed a sequence of frames into the world model to simulate alternative trajectories, gather data on successful trajectories.

To bridge the gap between today's labor-intensive approaches and tomorrow's general-purpose robotics, the community needs to adopt faster, cheaper, and safer ways to scale foundation models. At Coda Robotics, we're building world models that make this possible - infusing scalability and safety into the core of robotic learning.

Julian Saks

Co-Founder, CEO

Juan Vera

Co-Founder, Chief Scientist