Punching Through NVIDIA NemoClaw's Sandbox to Hit Local vLLM on RTX 5090
Disclaimer: This is an experimental build, not a production setup. NemoClaw is early-stage, the network hacks are ugly, and everything is volatile. I'm documenting this because I couldn't find anyone else doing it — not because I recommend it.
What Is NemoClaw?
NVIDIA NemoClaw (also known as OpenShell) is a sandboxed execution environment for AI agents. It runs a k3s cluster inside Docker, creates isolated sandbox namespaces, and lets AI agents execute code safely — file operations, shell commands, network requests — all inside a locked-down container.
The intended workflow: your agent talks to NVIDIA's cloud inference API. The sandbox allows outbound HTTPS to integrate.api.nvidia.com and blocks almost everything else.
But what if you have an RTX 5090 sitting right there on the host, running vLLM with a perfectly good Nemotron 9B? Paying for cloud inference when your local GPU is idle feels wrong.
So I made the sandbox talk to my local vLLM instead. It took three layers of hacking.
The Network Topology
WSL2 Host (172.21.73.126)
vLLM on 0.0.0.0:8000
Docker bridge: 172.18.0.1
|
v
openshell-cluster (172.18.0.2)
k3s cluster
Pod main namespace (10.200.0.1)
TCP relay -> 172.18.0.1:8000
|
v
Sandbox namespace (10.200.0.2)
~/ask, ~/review, opencode
-> 10.200.0.1:8000
The sandbox lives inside a network namespace inside a pod inside a k3s cluster inside a Docker container. Four levels of isolation between it and the host's vLLM port.
Layer 1: Host iptables
Docker's DOCKER-USER chain blocks cross-bridge traffic by default. The k3s container can't reach the host's vLLM port without explicit permission:
sudo iptables -I DOCKER-USER 1 -i br-dfb8ee9c1f32 -p tcp --dport 8000 -j ACCEPT
sudo iptables -I FORWARD 1 -i br-dfb8ee9c1f32 -o eth0 -p tcp --dport 8000 -j ACCEPT
This opens the door from the Docker bridge network to vLLM. Volatile — resets on WSL2 restart.
Layer 2: Network Policy + TCP Relay
The sandbox can only reach endpoints explicitly listed in its network policy. By default, that's just integrate.api.nvidia.com:443.
First, update the policy to allow local addresses:
nvidia_inference:
endpoints:
- { host: integrate.api.nvidia.com, port: 443 }
- { host: 10.200.0.1, port: 8000 }
- { host: 172.18.0.1, port: 8000 }
openshell policy set master-impala -g nemoclaw --policy policy.yaml --wait --timeout 30
But the sandbox namespace (10.200.0.2) can't reach the Docker bridge (172.18.0.1) directly — they're in different network namespaces. So I run a TCP relay in the pod's main namespace, bridging the two:
# relay.py — runs in pod main namespace (10.200.0.1)
server.bind(("10.200.0.1", 8000))
# Forwards to Docker bridge -> host vLLM
backend.connect(("172.18.0.1", 8000))
Layer 3: Sandbox iptables Injection
Even with the policy updated and relay running, the sandbox's own iptables OUTPUT chain has a blanket REJECT rule. We need to inject an ACCEPT rule before it:
SANDBOX_PID=$(docker exec openshell-cluster-nemoclaw kubectl exec master-impala -n openshell -- cat /var/run/sandbox.pid)
docker exec openshell-cluster-nemoclaw kubectl exec master-impala -n openshell -- nsenter -t $SANDBOX_PID -n iptables -I OUTPUT 1 -d 10.200.0.1 -p tcp --dport 8000 -j ACCEPT
Now the full path works: sandbox -> relay -> Docker bridge -> host vLLM.
The Hard Part: Making Tool Calls Actually Work
Getting inference responses from the sandbox was only half the battle. The real challenge was making the AI agent (opencode) execute tools — file read/write, shell commands — through local inference.
The Problem
Nemotron 9B outputs tool calls as raw text in its response:
<TOOLCALL>[{"name":"read_file","arguments":{"path":"app.py"}}]</TOOLCALL>
But AI coding agents like opencode expect OpenAI-compatible structured tool_calls objects in the API response. There's a mismatch at two levels:
-
With
toolsparameter: When a client sends atoolsparameter in the API request, vLLM can use a custom tool parser plugin to convert the text. I wrote a parser registered via@ToolParserManager.register_module(name="nemotron_toolcall")that extracts<TOOLCALL>blocks and returns structured tool call objects. This works for direct API calls (e.g. curl withtoolsin the request body). -
Without
toolsparameter: opencode doesn't sendtoolsas an API parameter — it embeds tool definitions in the system prompt instead. This means vLLM's parser never activates, and the<TOOLCALL>text comes back as plaincontent.
The Solution: A Gateway That Rewrites SSE Streams
A gateway server sits between the agent and vLLM:
opencode -> Gateway (:8000) -> vLLM (:8100)
The gateway buffers the streaming SSE response, accumulates the content field across chunks, and checks for <TOOLCALL> patterns. When detected, it:
- Strips the
<TOOLCALL>text fromcontent - Parses the JSON inside
- Injects structured
tool_callsinto the final SSE response
This means tool execution works regardless of whether the client sends tools in the request. The gateway also manages on-demand vLLM startup/shutdown to free VRAM when idle.
The Result
With the network hacks and the gateway in place, the opencode agent inside the sandbox can:
- Read and write files via tool calls
- Execute shell commands
- Iterate on code with multi-turn tool use
All powered by local Nemotron 9B on the RTX 5090, with zero cloud API calls.
What You Get
Once all three layers are in place:
# Inside the sandbox
~/ask "Explain PagedAttention in 3 sentences"
# -> local RTX 5090 inference
~/review app.py
# -> code review via local Nemotron
opencode
# -> full AI coding agent with tool execution, local GPU
Zero cloud API calls. The sandbox still isolates code execution — the agent can't escape to the host filesystem — but inference stays local.
The Catch: Everything Is Volatile
| Component | Survives restart? |
|---|---|
| Host iptables | No |
| TCP relay | No |
| Sandbox iptables | No |
| Network policy | Yes |
| Sandbox files | No |
A startup script that re-applies all three layers is essential.