Research paper

Turning MiniCPM-V 4.6 into a Fire Boy VLA

This page packages the training story into a judge-ready research artifact: how the MiniCPM-V backbone was frozen, how the router was added, which action-head experiments failed, which ones survived closed-loop physics, where Modal and RunPod fit, and how Codex helped scaffold the whole experimental loop.

Architecture diagram for MiniCPM-V 4.6, router head, action heads, skill dispatch, and runtime proof
0/20single-step manipulation before action chunks
20/20pick and eat after chunked action policies
512/512frozen router skill choices on held-out eval rows
0.0170frozen router mean parameter MAE
Demo story

The objective: an AI-age virtual toy with a real inner loop

This project came from an earlier attempt to build a real robot. The practical pivot was to build the thing that felt most alive first: a virtual toy, somewhere between Talking Tom, Tamagotchi, and an imaginary Pokémon-like companion, but updated for multimodal AI, physics, memory, and learned action. The goal is not just a chatbot with a mascot. The goal is a small creature that sees its room, understands commands, forms habits, chooses actions, plays with objects, reacts physically, and slowly develops a recognizable personality.

Why virtual first?Simulation lets the pet fail, recover, explore, and build behavior data without the cost and fragility of a physical robot.
Why a tiny VLA?A small MiniCPM-class model can combine vision, language, state, and action while staying cheap enough for experiments and consumer-grade deployment.
Why it matters?AI pets can become autonomous toys: they can interact with humans, other toys, and their own environment in ways that scripted games cannot.

The hackathon build is the tiny version of that thesis. Fire Boy can be commanded, routed through a learned MiniCPM-V policy layer, retargeted into a rigged body, and shown with proof videos and debug traces. The future version is a full-stack pet brain: perception, memory, personality, physics rollouts, skill learning, reflex control, and a slow deliberative model working together.

Abstract

What changed from “vision model” to “vision-language-action model”?

MiniCPM-V 4.6 already supplies image-language understanding, but Fire Boy needs actions: walk to a target, run around, pick up an object, eat a berry, and retarget proof trajectories onto a rigged Three.js body. The conversion freezes the MiniCPM-V backbone first, extracts a 1024-dimensional vision-language embedding, concatenates a robot-state vector, and trains small heads around that representation. The public demo promotes the skill-parameter router because it proved more reliable than a single monolithic low-level controller.

Backbone and state

The router reads camera/image context, the user instruction, root pose, hand and mouth positions, target positions, task flags, stage flags, previous action, and grasp/eaten state. That state vector matters because the image alone does not expose velocities, contact phase, prior action, or whether an object is already held.

Actions and dispatch

The final router predicts one of four skills plus six parameters: target_x, target_y, target_z, radius, speed_hint, and object_is_berry. Bounds and stabilizers keep those outputs in safe scene ranges before dispatching to policy registry skills.

Fire Boy modeling pipeline from concept art to SAM cleanup, rigging, MuJoCo body, and live retarget

Modeling Pipeline

Fire Boy starts as concept art, moves through SAM-style extraction and cleanup, becomes a 20-bone Blender GLB, is matched to a MuJoCo body, and finally returns to the browser as retargeted Toy Room motion.

Local, RunPod, Modal, Newton, and Toy Room artifact synchronization diagram

VM and Runtime Sync

The practical research loop was local orchestration, RunPod GPU training/eval, artifact copyback, Modal serverless inference, and a documented Newton/Warp lane for future GPU rollout throughput.

Trainable parameter counts for router head, residual action head, and LoRA adapter

Head Sizes

The promoted frozen router trains 814,090 head parameters. The residual 10x32 action-chunk head trains 1,078,912 parameters, while the LoRA adapter adds 4,743,168 trainable parameters across layers 0-23.

Sample rollout observations and Toy Room proof screenshots

Rollout Samples

The dataset rows are image + state + command + action labels sampled from rollout frames, not isolated stills. Those same proof paths become the browser policy gallery and retarget evidence.

Policy training ladder from rollouts to behavior cloning, chunks, residual VLA, router, and future reinforcement learning

Policy Training Ladder

The technical path was deliberately staged: behavior cloning first, action chunks when one-step control failed, residual VLA heads for manipulation, then a safe router for the public demo.

Future single-brain virtual pet stack with MiniCPM brain, validated action contract, fast local controller, memory, Modal inference, and rollout factory

Future Pet Stack

The larger version should separate a slow MiniCPM/Omni-style pet brain from a fast local reflex controller so the creature can feel alive without waiting on every model call.

Paper-style notes

How the embodied asset and policy loop were built

The avatar pipeline is part of the model, not decoration. The SAM-cleaned Fire Boy body was repaired into a single connected mesh, bound to a humanoid skeleton, exported with motion clips, then compared against a MuJoCo body so simulator traces could drive the browser rig. The Fire Boy rig report records 3,311 vertices, zero unweighted vertices, and a clean single-component body for the shipped base.

The data pipeline is similarly explicit. The first four-skill manifest had 64 episodes and 2,368 image rows; the focused manipulation manifest expanded contact-heavy tasks to 144 episodes and 6,192 rows; the all-skill router manifest wrote 3,072 skill-parameter rows with images required.

# Sketch of the promoted VLA router
vl = MiniCPM_V_4_6(image, instruction).mean_pool()  # 1024-d, frozen
state = build_robot_state(root, hands, mouth, targets, flags, previous_action)  # 42-d
h = silu(linear(concat(vl, state), 1066, 512))
h = silu(linear(h, 512, 512))
skill_logits = linear(h, 512, 4)
params = linear(h, 512, 6)  # target_xyz, radius, speed_hint, object_is_berry
loss = cross_entropy(skill_logits, skill_id) + 0.35 * mse(params, target_params)
Experiment success-rate graph

Experiment Progression

Single-step manipulation failed, chunked manipulation worked, the first mixed VLA dataset exposed data imbalance, and the final router solved the public-demo decision layer.

Router validation-loss and LoRA validation-MAE curve

Loss and Validation Curve

The frozen router kept perfect skill accuracy while validation parameter loss moved downward. LoRA preserved classification, but parameter precision was less stable.

Frozen versus LoRA router parameter mean absolute error by parameter

Router Parameter MAE

The LoRA router's biggest regression was coordinate precision, especially target_y. In physics, a small coordinate error becomes a visible miss.

Final active skill proof graph

Final Active Skills

The demo stack combines proved skills: walk and run from MuJoCo articulated policies, manipulation from MiniCPM-V LoRA/residual proof lanes, and the frozen router for command selection.

Router internals

The router is the demo-safe VLA layer

PartRole in the VLAWhy it matters
Frozen MiniCPM-V 4.6Produces pooled vision-language hidden-state features from the image and instruction.Protects the pretrained visual-language representation while the action interface is still being tested.
Robot state constructorBuilds navigation, object, hand, mouth, previous-action, task, stage, grasp, and eaten features.Gives the model proprioception that cannot be trusted from pixels alone.
MLP trunkFuses the 1024-d MiniCPM-V embedding with the state vector through SiLU layers.Keeps training fast and interpretable on RunPod GPUs.
Skill headPredicts walk_to, run_around, pick_up, or find_and_eat_berry.Maps language into a bounded action vocabulary with MP4 proof.
Parameter headPredicts grounded target and behavior parameters.Lets the same skill adapt to the ball, berry, viewer camera, or marker.
StabilizersClip output ranges, copy explicit scene targets, and prefer heuristic skill for explicit commands unless forced neural.Makes learned routing robust enough for a live demo.
# Trainable sizes in the artifact
router_head = 814_090            # 1066 -> 512 -> 512 -> skill(4) + params(6)
residual_action_head = 1_078_912 # VL branch + state branch + 10x32 action chunk
lora_adapter = 4_743_168         # rank 8 adapters on q/k/v/o/gate/up/down, layers 0-23

# Router output contract
skills = ["walk_to", "run_around", "pick_up", "find_and_eat_berry"]
params = ["target_x", "target_y", "target_z", "radius", "speed_hint", "object_is_berry"]
What failed

The failed paths explain the final architecture

ExperimentResultLesson
Single-step manipulationpick_up 0/20, go_eat_berry 0/20One action target averaged across approach, reach, descend, close, lift, and mouth-transfer phases.
First mixed manifestpick/eat 2/8 each, run 8/8, go_to 7/8The JSONL-to-head path worked, but contact-heavy data was too sparse.
All-skill direct low-level headpick 2/2, run 2/2, eat 0/2, go_to 0/2A single low-level controller could not cover contact manipulation and locomotion reliably.
Direct go_to MiniCPM-V variants1-step, root-velocity, and recovery-root-velocity variants stayed at 0/5Closed-loop navigation exposed root drift and saturation that offline loss did not solve.
LoRA router promotionperfect skill accuracy, but eval MAE 0.0629 vs frozen 0.0170Backbone adaptation was real, but the frozen router was numerically safer for the demo.
Simulation and hardware

MuJoCo proof, NVIDIA Newton lane, RunPod training, Modal inference

MuJoCo is the proof source for the shipped policy gallery: eval JSON, MP4/GIF rollouts, body render checks, and Toy Room trajectory retargeting. The NVIDIA Newton lane is documented as the GPU physics scaling route: Fire Boy asset load test, CUDA rollout, qpos/action trace export, and MP4/USD/NPZ artifacts. The current public proof remains MuJoCo-backed, with Newton/Warp as the next simulation-throughput path.

  • RunPod RTX 6000 Ada: frozen residual MiniCPM-V manipulation head, 2048 rows, 3/3 pick and 3/3 eat.
  • RunPod NVIDIA A40: frozen and LoRA skill-parameter router experiments.
  • Modal L40S: live MiniCPM-o 4.5 WebSocket runtime with Volume cache and Hugging Face Secret.
  • Local app: FastAPI/Gradio-style server, Three.js rig, CANNON interactions, policy gallery, browser screenshots, and PDF delivery.
Sampling and parameters

The model was trained around visible state, not hidden magic

The router rows pair a rendered observation with the instruction, a normalized robot-state vector, a skill id, and six grounded parameters. Contact tasks needed more data than navigation because pickup and eating pass through several phases: approach, reach above, descend, close, lift, transfer to mouth, and verify object state.

That is why direct one-step manipulation failed at 0/20, while chunked action policies reached 20/20. The final presentation keeps the router as the demo-safe VLA layer and treats low-level action heads as skill-specific research lanes until they pass broader randomized closed-loop tests.

Dataset / headSizeWhat it taught us
First four-skill manifest64 episodes, 2,368 image rowsThe JSONL and image pipeline worked, but manipulation was underrepresented.
Focused manipulation manifest144 episodes, 6,192 rowsMore contact-phase examples turned pick/eat from brittle into reliable eval behaviors.
Skill-parameter router manifest3,072 rowsBalanced high-level routing over pick_up, go_eat_berry, run_around, and go_to_point.
Frozen router head814,090 trainable parametersSmall enough to inspect and stable enough to promote for the live demo.
LoRA router adapter4,743,168 trainable parametersProved adaptation works, but coordinate MAE was worse than the frozen head in this run.
Policy training

How the policy was trained and why the design changed

The training loop started with behavior cloning because it gives a stable first signal: image, instruction, state, and the demonstrated action that solved the rollout. That worked for route selection and simple movement, but contact manipulation exposed the hard part of embodiment. A single next-action label averaged across approach, contact, lift, and mouth-transfer phases, so the controller looked reasonable offline but failed in closed-loop physics. The fix was to train short action chunks and then promote a skill-parameter router for the live demo.

StageTraining signalReason for keeping or changing it
Behavior cloning baselineRendered rollout image, language command, robot state, next action.Fast, interpretable, and enough to prove the data path, but too phase-blind for manipulation.
Action chunks10-step x 32-actuator targets for contact-heavy pick/eat rollouts.Preserved temporal phase, turning manipulation from 0/20 into 20/20 in the focused eval.
Residual VLA headMiniCPM-V embedding plus state controller plus residual action branch.Let vision-language features adjust the state policy without letting the whole controller drift.
Skill-parameter routerFour skill labels plus six continuous parameters.Most robust public-demo layer: it routes to proved skills instead of asking one head to solve every motor regime.
Future RL layerTask success, contact, smoothness, energy, time, curiosity, and personality rewards.Useful only after the sim lane can generate enough rollouts and failures; otherwise RL optimizes brittle shortcuts.
VLA design space

Other ways to convert a multimodal model into a VLA

The current build uses a frozen MiniCPM-V encoder plus small action heads because that was the safest route for a hackathon demo. It is not the only way, and it is not the final most optimized form. A larger version should test multiple conversion strategies against the same closed-loop physics suite.

Conversion methodWhat changesWhen it is better
Frozen VLM + action headFreeze MiniCPM-V, train only router/action MLPs.Best first pass: cheap, stable, debuggable, and easy to run on modest GPUs.
LoRA / QLoRA adapterTrain low-rank adapters in the language/vision-language stack plus action heads.Useful when the model must internalize action semantics, but it needs better numeric grounding checks.
Action-token fine-tuningRepresent skills, coordinates, or joint chunks as tokens and fine-tune the model to emit them.Good for a single brain/controller interface, especially if the runtime can validate structured tokens.
Diffusion or flow action headGenerate action chunks as trajectories rather than single deterministic vectors.Better for multimodal behavior where several actions are valid, such as playful motion or grasp approach.
World model + plannerLearn latent dynamics, predict future states, and plan before acting.Better for long-horizon pet behavior, habits, curiosity, and multi-toy interactions.
RL after imitationStart from behavior cloning, then improve in simulation using rewards.Best once Newton/MuJoCo rollouts are fast enough to find many failures and recoveries.
Single brain roadmap

From tiny demo to full virtual-pet controller

A future MiniCPM-V or MiniCPM-o style system could become the single high-level pet brain: see the toy room, understand speech, remember prior interactions, choose goals, select motor skills, and explain its own behavior. The fully fine-tuned route would train the model on multimodal episodes with observation frames, state summaries, action traces, rewards, and personality labels, then ask it to emit a structured action plan or compact action tokens.

The most practical architecture is still hierarchical. A slow brain handles language, personality, memory, and goal choice. A fast local policy handles millisecond reflexes: balance, hand motion, contact correction, and object attachment. Modal or another serverless GPU lane can run the slow multimodal brain, while a distilled local controller keeps the toy responsive.

LayerLatency targetOptimization path
Fast reflex policymillisecondsDistilled MLP/transformer head, quantized ONNX/WebGPU/WASM, cached state, no image model in the inner loop.
Skill routertens of milliseconds when warmFrozen MiniCPM features cached per frame, small head local or on a warm GPU worker.
Pet brainsub-second to secondsModal warm containers, model weight cache, prompt/state compression, streaming responses, tool-call validation.
Training factoryofflineRun many MuJoCo/Newton rollouts, mine failures, relabel with success/failure, retrain adapters and skill heads.
Codex throughout

OpenAI Codex helped turn the research loop into a presentable system

The build history shows a steady chain of OpenAI Codex-assisted work: shipping Toy Room v3, adding the Fire Boy command loop, wiring MiniCPM-V action paths, routing Toy Room v3 through Modal MiniCPM, adding brain trace diagnostics, hardening Modal WebSocket timeouts, keeping the MiniCPM-V loop live, making locomotion and pickup physical, grounding pickup targets, adding gestures, generating screenshots, and packaging this paper. The value was not a single magic patch; it was iteration speed with evidence discipline.

Codex roleConcrete contribution
ScaffoldingRoutes, frontend pages, training scripts, policy-gallery wiring, static assets, and local verification loops.
Experiment hygieneJSON summaries, runbooks, artifact paths, screenshots, proof bundle validation, and paper/page generation.
DebuggingModal timeout hardening, action contract validation, grounding fixes, retarget bridge checks, and browser QA.
Demo polishPage directory, unique screenshots, PDF download/open controls, and a coherent narrative tying Modal, VLA, Newton, RunPod, and MiniCPM together.
Future scope

Where this goes beyond the current virtual-pet level

The current result is a practical VLA demo: MiniCPM-V sees the scene, the router grounds the command, and proof-backed policies make Fire Boy move, pick up, and eat in the Toy Room. The next research step is not a bigger page; it is broader closed-loop evidence with randomized objects, randomized rooms, longer tasks, and a Newton/Warp GPU rollout lane that can generate enough failures to train against.

References

Primary sources reflected in the paper

ReferenceWhy it is in the artifact
MiniCPM-V 4.6 model cardBackbone for frozen vision-language features and LoRA experiments.
Segment AnythingSAM-style segmentation is part of the avatar extraction and cleanup story.
MuJoCoSimulator used for articulated body proof, closed-loop eval, and retarget traces.
NVIDIA NewtonFuture GPU physics lane for scalable rollout generation and USD/MJCF-compatible traces.
ModalServerless GPU/runtime lane for the live MiniCPM-o action gateway.
OpenAI CodexAgentic coding partner used throughout scaffolding, debugging, screenshot generation, and packaging.
PDF

Read the full paper

The embedded paper below is the generated PDF artifact. Use the buttons if the browser prefers opening PDFs in a separate viewer.