PatentLLM Blog →日本語

PatentLLM SubsidyDB GitHub Inquiry
← All Articles Read in Japanese
GPU Inference

Punching Through NVIDIA NemoClaw's Sandbox to Hit Local vLLM on RTX 5090

Disclaimer: This is an experimental build, not a production setup. NemoClaw is early-stage, the network hacks are ugly, and everything is volatile. I'm documenting this because I couldn't find anyone else doing it — not because I recommend it.

What Is NemoClaw?

NVIDIA NemoClaw (also known as OpenShell) is a sandboxed execution environment for AI agents. It runs a k3s cluster inside Docker, creates isolated sandbox namespaces, and lets AI agents execute code safely — file operations, shell commands, network requests — all inside a locked-down container.

The intended workflow: your agent talks to NVIDIA's cloud inference API. The sandbox allows outbound HTTPS to integrate.api.nvidia.com and blocks almost everything else.

But what if you have an RTX 5090 sitting right there on the host, running vLLM with a perfectly good Nemotron 9B? Paying for cloud inference when your local GPU is idle feels wrong.

So I made the sandbox talk to my local vLLM instead. It took three layers of hacking.

The Network Topology

WSL2 Host (172.21.73.126)
  vLLM on 0.0.0.0:8000
  Docker bridge: 172.18.0.1
      |
      v
  openshell-cluster (172.18.0.2)
    k3s cluster
      Pod main namespace (10.200.0.1)
        TCP relay -> 172.18.0.1:8000
          |
          v
      Sandbox namespace (10.200.0.2)
        ~/ask, ~/review, opencode
        -> 10.200.0.1:8000

The sandbox lives inside a network namespace inside a pod inside a k3s cluster inside a Docker container. Four levels of isolation between it and the host's vLLM port.

Layer 1: Host iptables

Docker's DOCKER-USER chain blocks cross-bridge traffic by default. The k3s container can't reach the host's vLLM port without explicit permission:

sudo iptables -I DOCKER-USER 1   -i br-dfb8ee9c1f32 -p tcp --dport 8000 -j ACCEPT

sudo iptables -I FORWARD 1   -i br-dfb8ee9c1f32 -o eth0 -p tcp --dport 8000 -j ACCEPT

This opens the door from the Docker bridge network to vLLM. Volatile — resets on WSL2 restart.

Layer 2: Network Policy + TCP Relay

The sandbox can only reach endpoints explicitly listed in its network policy. By default, that's just integrate.api.nvidia.com:443.

First, update the policy to allow local addresses:

nvidia_inference:
  endpoints:
    - { host: integrate.api.nvidia.com, port: 443 }
    - { host: 10.200.0.1, port: 8000 }
    - { host: 172.18.0.1, port: 8000 }
openshell policy set master-impala -g nemoclaw   --policy policy.yaml --wait --timeout 30

But the sandbox namespace (10.200.0.2) can't reach the Docker bridge (172.18.0.1) directly — they're in different network namespaces. So I run a TCP relay in the pod's main namespace, bridging the two:

# relay.py — runs in pod main namespace (10.200.0.1)
server.bind(("10.200.0.1", 8000))
# Forwards to Docker bridge -> host vLLM
backend.connect(("172.18.0.1", 8000))

Layer 3: Sandbox iptables Injection

Even with the policy updated and relay running, the sandbox's own iptables OUTPUT chain has a blanket REJECT rule. We need to inject an ACCEPT rule before it:

SANDBOX_PID=$(docker exec openshell-cluster-nemoclaw   kubectl exec master-impala -n openshell --   cat /var/run/sandbox.pid)

docker exec openshell-cluster-nemoclaw   kubectl exec master-impala -n openshell --   nsenter -t $SANDBOX_PID -n   iptables -I OUTPUT 1 -d 10.200.0.1 -p tcp --dport 8000 -j ACCEPT

Now the full path works: sandbox -> relay -> Docker bridge -> host vLLM.

The Hard Part: Making Tool Calls Actually Work

Getting inference responses from the sandbox was only half the battle. The real challenge was making the AI agent (opencode) execute tools — file read/write, shell commands — through local inference.

The Problem

Nemotron 9B outputs tool calls as raw text in its response:

<TOOLCALL>[{"name":"read_file","arguments":{"path":"app.py"}}]</TOOLCALL>

But AI coding agents like opencode expect OpenAI-compatible structured tool_calls objects in the API response. There's a mismatch at two levels:

  1. With tools parameter: When a client sends a tools parameter in the API request, vLLM can use a custom tool parser plugin to convert the text. I wrote a parser registered via @ToolParserManager.register_module(name="nemotron_toolcall") that extracts <TOOLCALL> blocks and returns structured tool call objects. This works for direct API calls (e.g. curl with tools in the request body).

  2. Without tools parameter: opencode doesn't send tools as an API parameter — it embeds tool definitions in the system prompt instead. This means vLLM's parser never activates, and the <TOOLCALL> text comes back as plain content.

The Solution: A Gateway That Rewrites SSE Streams

A gateway server sits between the agent and vLLM:

opencode -> Gateway (:8000) -> vLLM (:8100)

The gateway buffers the streaming SSE response, accumulates the content field across chunks, and checks for <TOOLCALL> patterns. When detected, it:

This means tool execution works regardless of whether the client sends tools in the request. The gateway also manages on-demand vLLM startup/shutdown to free VRAM when idle.

The Result

With the network hacks and the gateway in place, the opencode agent inside the sandbox can:

All powered by local Nemotron 9B on the RTX 5090, with zero cloud API calls.

What You Get

Once all three layers are in place:

# Inside the sandbox
~/ask "Explain PagedAttention in 3 sentences"
# -> local RTX 5090 inference

~/review app.py
# -> code review via local Nemotron

opencode
# -> full AI coding agent with tool execution, local GPU

Zero cloud API calls. The sandbox still isolates code execution — the agent can't escape to the host filesystem — but inference stays local.

The Catch: Everything Is Volatile

Component Survives restart?
Host iptables No
TCP relay No
Sandbox iptables No
Network policy Yes
Sandbox files No

A startup script that re-applies all three layers is essential.