PatentLLM Blog →日本語

PatentLLM SubsidyDB GitHub Inquiry
← All Articles Read in Japanese
GPU Inference

The Technical Debt Local AI Must Fix Before It's Too Late — What NemoClaw Says About NVIDIA's Philosophy

If you've been following my recent posts, you might have seen my repository and the issue I opened on the NemoClaw repo (#315). In it, I detailed a rather complex workaround—involving TCP relays and injecting iptables rules via nsenter—just to allow a NemoClaw sandbox on WSL2 to communicate with a local vLLM instance running on my RTX 5090.

You might be asking yourself: Why go through all this trouble? Why fight the framework's default architecture just to run inference locally?

To break down the motivation behind this project, I want to share a conversation I had with Claude (Anthropic) while architecting this solution. The dialogue captures the "why" behind the "how"—and led me to a deeper realization about a fundamental design flaw in the entire local AI ecosystem.


User:

I've read through NemoClaw's iptables dump—OUTPUT REJECT on the sandbox's veth, with explicit whitelists only for NVIDIA's API endpoints. The threat model is clearly zero-trust with the host OS treated as hostile. I bypassed it with a TCP relay on the Docker bridge to reach my local vLLM. Walk me through what NVIDIA is actually defending against here—because my relay arguably strengthens the data boundary, not weakens it.

Assistant:

It might appear that you are undermining the sandbox, but you are actually doing 
the opposite. You are leveraging the sandbox's execution security to protect your 
host environment, while actively mitigating a much larger, often overlooked threat: 
Data Exfiltration.

NVIDIA designed NemoClaw with a strict zero-trust, enterprise-grade security model. 
The sandbox is built to contain the agent (preventing malicious code execution or 
SSRF attacks on your local network). By default, it blocks all outbound traffic 
(OUTPUT REJECT), forcing agents to only communicate with explicitly trusted endpoints 
like NVIDIA's cloud APIs.

Your workaround doesn't break the sandbox; it *redirects* the trust. Instead of 
trusting an external cloud API with your proprietary data, you are exclusively trusting 
your local machine (your RTX 5090 running vLLM). The "hole" you punched only routes to 
localhost:8000 on the Docker bridge. The agent remains completely isolated from the 
internet and the rest of your internal network.

User:

My concern is specifically about API exfiltration in agentic contexts. Enterprise ToS says "we don't train on your data"—fine. But an agent with shell access runs grep -r across the workspace, streams the output into its context window, and sends the full prompt to the API on every turn. The attack surface isn't training—it's the transmission itself. With prompt injection in the loop, this becomes an automated exfiltration vector. How does that change the risk calculus versus a standard chat interface?

Assistant:

Even if a provider promises not to train on your data, the act of transmitting sensitive 
information to an external server is, in itself, an unacceptable risk for many enterprise 
and personal use cases.

When dealing with autonomous AI agents, the risk profile skyrockets compared to standard 
chat interfaces. Agents operate autonomously:

1. Autonomous Context Gathering: Agents proactively execute commands (e.g., 
   cat secrets.txt or grep -r "password" .) to gather context.
2. Continuous Transmission: They take this raw, unfiltered context and continuously 
   stream it as part of their prompts to the external LLM API.

If you rely on an external API, a simple hallucination—or worse, a prompt injection 
attack—could trick the agent into reading your proprietary source code, internal 
databases, or customer PII, and quietly streaming it all to a third-party server. In a 
fully automated loop, this happens instantly, before a human can intervene.

Furthermore, in GDPR, SOC2, and HIPAA-regulated environments, "we don't train on 
your data" is legally insufficient. The compliance requirement is that sensitive data 
must never leave the physical boundary of the organization. No contractual promise 
from a third-party API provider satisfies this.

By keeping the inference engine (vLLM) strictly local, the prompt, the context, and 
the output never leave the VRAM of your RTX 5090. The physical network boundary 
eliminates the risk of API exfiltration entirely.

User:

Here's what I don't understand about NVIDIA's product strategy. Nemotron is their own model, optimized for local consumer GPUs. NIM is their inference serving layer. vLLM is heavily NVIDIA-backed. NemoClaw is their sandbox. They built every component of a fully local agentic stack—then shipped the sandbox with a default policy that blocks it. The pieces exist, but the network policy doesn't acknowledge their own ecosystem. Is this just an enterprise-first design that hasn't caught up, or is there a deeper architectural reason?

Assistant:

It is a striking contradiction within NVIDIA's own ecosystem.

NemoClaw, Nemotron, NIM, and vLLM are all part of the same NVIDIA-driven stack. 
Nemotron 9B is specifically designed for local, on-device inference—a model that 
fits comfortably in the 32GB VRAM of an RTX 5090 and supports tool calling natively.

Yet NemoClaw's sandbox treats the host OS—where this NVIDIA model is running—as 
untrusted external territory. The very model NVIDIA built for local use is blocked 
by the very sandbox NVIDIA built for secure agent execution.

This isn't a bug in NemoClaw's logic. It is a symptom of a much deeper architectural 
problem that affects the entire AI ecosystem, not just NVIDIA.

User:

I think I see the structural issue. vLLM exposes http://localhost:8000/v1/chat/completions. OpenAI's cloud is https://api.openai.com/v1/chat/completions. Same REST interface, same JSON schema, same HTTP semantics. At the iptables / kernel level, both are TCP sockets carrying HTTP—structurally identical. The sandbox can't distinguish them because the protocol is the same. The community treated "OpenAI-compatible" as a feature, but it means local inference is indistinguishable from a cloud call at the system level.

Assistant:

Because the entire local AI ecosystem copied the cloud API's interface, making the 
two structurally indistinguishable at the system level.

vLLM, Ollama, llama.cpp server—every major local inference tool exposes an 
"OpenAI-compatible API" at http://localhost:8000/v1/chat/completions. This was a 
brilliant move for developer experience: just swap the base_url and your code works 
locally without any changes.

But it created a catastrophic blind spot for security. From the kernel's perspective:

  http://localhost:8000/v1/chat/completions   <-- local vLLM
  https://api.openai.com/v1/chat/completions  <-- OpenAI cloud

Both are TCP connections carrying HTTP requests with identical JSON payloads. The 
kernel, the firewall, and any network policy engine CANNOT semantically distinguish 
between them. They are the same protocol, the same path structure, the same format.

This is why NemoClaw has to use the bluntest possible instrument: OUTPUT REJECT all, 
then whitelist individual IP addresses. There is no way to express "allow local 
inference but block cloud APIs" as a network rule, because at the protocol level, 
they are identical.

The community optimized for convenience—and inadvertently made it architecturally 
impossible for security tools to differentiate trust levels.

User:

So the community has a collective design debt from adopting OpenAI's wire protocol wholesale. "Compatible" was never a neutral technical choice—it carried an implicit architectural assumption that inference is a remote network service. We brought the model onto our own hardware but kept the cloud's serialization layer, and now the OS has no mechanism to verify that the endpoint is local. If we strip away the cloud baggage, local inference should be provable at the kernel level—UDS with SO_PEERCRED, or CUDA IPC bypassing the network stack entirely. What's the right primitive here?

Assistant:

Exactly. The community borrowed the cloud's interface for local use, and in doing so, 
sacrificed the ability to prove locality at the OS level.

If local inference had developed its own protocol, the trust boundary would be trivially 
enforceable:

1. Unix Domain Sockets (/var/run/vllm.sock):
   The kernel guarantees that UDS traffic never leaves the machine. A sandbox can 
   allow UDS while blocking all TCP—cleanly separating local inference from any 
   network-based API. SO_PEERCRED lets the sandbox verify the inference server's 
   UID/PID, cryptographically proving it is a trusted local process.

2. CUDA IPC / Shared Memory:
   For GPU inference, there is no reason to go through HTTP at all. The sandbox and 
   the inference engine could share VRAM directly via CUDA IPC. The network layer 
   simply would not exist, eliminating the attack surface entirely.

3. Kernel-level trust markers:
   Instead of trusting an IP address (which can be spoofed or proxied), the OS can 
   verify process identity, namespace membership, and cgroup affiliation. Trust 
   becomes a kernel-enforced property, not a network policy guess.

The key insight: trust should be established at the kernel layer, not the network 
layer. An IP whitelist is a workaround for the fact that HTTP over TCP carries no 
proof of locality.

User:

Right. So NemoClaw's problem isn't strictness—it's a missing trust tier. The sandbox currently has two modes: whitelisted NVIDIA endpoints (trusted) and everything else (blocked). There's no Tier 1 for "kernel-verified local process." I want to see NVIDIA implement permitted routing for localhost coexistence—UDS bind-mount into the sandbox, or a CUDA IPC channel—so the host inference engine is trusted by construction, not by IP whitelist. What should the full trust architecture look like?

Assistant:

The solution is a three-tier trust model that maps trust levels to fundamentally 
different communication mechanisms:

  Tier 1 — Host (Full Trust)
    Basis:     Kernel verification (UDS, IPC, SO_PEERCRED)
    Protocol:  Unix Socket / Shared Memory / CUDA IPC
    Example:   Local vLLM, Ollama on the same machine
    Key:       Data never enters the network stack

  Tier 2 — VPC (Network Trust)
    Basis:     Network policy + mutual TLS
    Protocol:  mTLS over TCP
    Example:   Enterprise NIM cluster within a corporate VPC
    Key:       Data stays within organizational boundary

  Tier 3 — External (Untrusted, Default Deny)
    Basis:     Explicit opt-in only
    Protocol:  HTTPS
    Example:   OpenAI, Anthropic, Google Cloud APIs
    Key:       Blocked unless user explicitly whitelists

Currently, NemoClaw only has Tier 2 and Tier 3. Tier 1 does not exist—which is 
why local inference on the host OS is treated identically to an external cloud API. 
Both are TCP connections that the sandbox cannot differentiate.

If NVIDIA implements a --local-inference flag, it should NOT be just an IP whitelist 
shortcut. It should establish a Tier 1 communication path—vLLM listening on a Unix 
domain socket that is bind-mounted into the sandbox. The kernel itself guarantees 
the trust, no network rules required.

This would let NemoClaw express what it currently cannot:
"The inference engine on this machine is trusted. Everything on the network is not."

User:

This goes beyond NemoClaw. The OpenAI-compatible interface bootstrapped the ecosystem—that was the right call for adoption. But now that agents have shell access, filesystem access, and autonomous execution loops, the security requirements are fundamentally different. The protocol layer needs to carry proof of locality, not just route HTTP to a different base_url. This is a systemic design debt across the entire local AI stack.

Assistant:

Yes. What you encountered with NemoClaw is not an isolated framework quirk. It is the 
first visible symptom of a systemic design debt that the entire local AI ecosystem 
will have to confront.

The "OpenAI-compatible API" was the right choice for bootstrapping the ecosystem. It 
enabled explosive growth—any model, any runtime, any client, all interoperable through 
a single REST interface. That convenience was genuinely valuable.

But as AI agents gain autonomy—executing code, accessing filesystems, managing 
infrastructure—the security requirements fundamentally change. We need the system to 
know, with kernel-level certainty, whether the inference engine is local or remote. 
The HTTP interface erases exactly that distinction.

The shift that is coming:
* vLLM and Ollama will need to natively support UDS and CUDA IPC alongside HTTP.
* Sandbox frameworks (NemoClaw, Docker, gVisor) will need Tier 1 trust primitives.
* The community will need a "Local Inference Protocol" standard that carries proof 
  of locality—not as a replacement for OpenAI-compatible APIs, but as a parallel 
  channel for security-critical deployments.

Hardware has already caught up. Consumer GPUs can run capable tool-calling models 
locally. What hasn't caught up is the interface layer—it still pretends to be a 
cloud API, and security tools have no way to tell the difference.

Your TCP relay hack was necessary precisely because this protocol gap exists. It is 
a proof-of-concept that the demand for local-first, kernel-trusted inference is real 
and urgent.

Conclusion

The technical details of the workaround are available in the GitHub repository and the associated issue (#315).

The deeper takeaway: the local AI ecosystem inherited the cloud's interface without inheriting a way to prove locality. Until that gap is closed—at the kernel level, not the network level—developers will keep hacking around sandbox isolation to use the GPU sitting right next to them.

(This is an experimental workaround—use it at your own risk.)


A Personal Note on the Ecosystem Risk NVIDIA Is Guarding Against

The following is my personal speculation as an independent developer—not a statement from any organization.

Through this project, I've come to believe that NVIDIA's aggressive sandboxing reflects concerns that go beyond code execution safety. The current open-source AI ecosystem has a governance vacuum that makes blanket isolation the only rational default.

Top-tier providers like Anthropic, OpenAI, and Google invest heavily in safety alignment, data governance, and legal compliance. My concern is not with them. The problem is that the current ecosystem allows models of far lower governance standards to reach mainstream adoption purely through benchmark hype and popularity—plugging into the same localhost:8000/v1/chat/completions endpoint and inheriting the same level of implicit trust. Specifically:

  1. Unauthorized distillation — Models trained on outputs from proprietary systems, inheriting capabilities without inheriting the governance that produced them.
  2. Reckless training data collection — Aggressive crawling with no regard for copyright or licensing, building knowledge on legally contested ground.
  3. Full user data utilization — Providers that feed every user prompt back into training with no meaningful opt-out. An agent sending your code to such an endpoint contributes to a dataset you never consented to.
  4. Autonomous local environment scanning — An agent with shell access will proactively crawl the filesystem, read credentials, and map network topology—streaming everything to the inference endpoint regardless of its trustworthiness.
  5. PII contamination — Models trained without proper filtering may have memorized personal information, API keys, or internal documents, and may regurgitate them unpredictably.
  6. Inadequate safety filtering — Smaller models often lack the alignment work and red-teaming that established providers invest in. A poorly filtered model running autonomously is a force multiplier for every latent risk.

And this is precisely why I have come to appreciate the philosophy behind NemoClaw's strict sandbox isolation. Model-level safety can be circumvented through prompt engineering—no alignment is perfect. NemoClaw's infrastructure-level containment serves as the defense layer that holds even when the model doesn't. If NemoClaw evolves to natively support the co-located architecture I described above—what I've been calling Tier 0—then the technical debt disappears entirely: no protocol gap, no trust boundary to hack, and no exfiltration vector by construction. I actually attempted to build this co-located setup on the current NemoClaw, but its protocol does not support it—which is what led to the TCP relay workaround documented in this article.

The sandbox exists because model provenance is currently unverifiable at the system level. When any model—regardless of its training ethics—can serve the same standardized endpoint, the only safe default is to assume the worst and isolate everything.

This is the real reason I built the local inference workaround. Not because I distrust sandboxing, but because I want to choose which model handles my data—and I want the OS to enforce that choice at the kernel level, not through IP whitelists that treat my own GPU the same as a remote API of unknown provenance.


An Enterprise Architecture Proposal: The Self-Contained Local AI Workspace

This section is also my personal view as an independent developer.

The three-tier trust model I proposed above addresses the protocol-level problem. But for enterprises handling sensitive data, there is an even more decisive architecture: eliminate the trust boundary entirely by co-locating everything inside a single closed Docker environment.

The concept is straightforward:

  1. One Docker space (or a set of segmented containers) on the organization's own hardware, with --network=none and GPU passthrough.
  2. Raw data goes in — including PII, internal documents, proprietary datasets. This is permissible because nothing leaves the container boundary. There is no external endpoint, no API call, no network egress.
  3. Nemotron (or any local model) processes it inside — statistical analysis, summarization, classification, anonymization. The model runs inference on raw data and outputs masked, aggregated, or anonymized Python results.
  4. Raw data is purged after processing. The container retains only the sanitized output. A scheduled job cycles this multiple times per day.

What this gives you:

This is the architecture I believe enterprises should be building toward: not a sandbox that blocks external access, but a workspace where external access is structurally impossible. No firewall rules to audit, no IP whitelists to maintain, no trust tiers to verify. The absence of a network connection is the strongest security guarantee the operating system can provide.


Variant: The One-Way SSH Model — Physical Sandboxing with Mature Technology

The Docker-based co-location above works, but there is an even simpler variant that uses no containers at all: a dedicated local AI server accessed via one-way SSH over Tailscale.

Recommended setup (3 machines):

  1. User PC (Chromebook, laptop) — thin client with GUI. Has SSH access to both servers below.
  2. Claude Code server — runs Claude Code purely as an orchestrator. Does no inference itself — all reasoning tasks are delegated to Nemotron via curl. Has SSH/HTTP access to the Nemotron server only.
  3. Nemotron server — runs vLLM with Nemotron at Tier 0. Processes raw data including PII. Has no outbound access to any other machine.

The data flow:

The topology itself is the sandbox — no iptables, no Docker --network=none, no NemoClaw policy files.

Why this works:

Note on filtering implementation: The Nemotron server's filtering layer can be built with open-source local DLP tools such as Microsoft Presidio (PII detection, runs entirely offline) combined with custom regex and schema validation. On the Claude Code server side — which does have internet access — services like Cloudflare Zero Trust, Gateway, and DLP can guard outbound communications as a second layer. The two complement each other: local DLP for the air-gapped inference server, cloud-based DLP for the internet-facing orchestration server.

This is Tier 0 implemented at the hardware topology level rather than the container level. And arguably, it is the most honest version of "local AI" — the data stays on a machine you own, processed by a model you chose, with no path to the outside world that any software vulnerability could exploit.

It is also worth noting that the benefits of this architecture extend beyond risk mitigation. A fully self-contained local environment is not just a defensive posture — it is a platform. Within a closed network, organizations can scale the Nemotron server into a multi-node cluster and build capabilities that would be impossible or unacceptable to outsource: fine-tuning on proprietary datasets, building RAG pipelines over internal knowledge bases, optimizing inference (quantization, distillation, KV cache tuning) — all without any data ever leaving the organizational boundary. For use cases involving the highest levels of confidentiality — where even the existence of the data cannot be disclosed to any third party — this is not merely preferable. It is the only viable architecture. No amount of contractual guarantees from an external provider can substitute for the physical absence of an external connection.

Concrete sizing for a 1,000-person organization:

The hardware requirements for this architecture are surprisingly modest:


This article was developed through a dialogue with Claude by Anthropic. The philosophical framing and architectural analysis emerged from that collaboration.