GPU Inference

The Technical Debt Local AI Must Fix Before It's Too Late — What NemoClaw Says About NVIDIA's Philosophy

If you've been following my recent posts, you might have seen my repository and the issue I opened on the NemoClaw repo (#315). In it, I detailed a rather complex workaround—involving TCP relays and injecting iptables rules via nsenter—just to allow a NemoClaw sandbox on WSL2 to communicate with a local vLLM instance running on my RTX 5090.

You might be asking yourself: Why go through all this trouble? Why fight the framework's default architecture just to run inference locally?

To break down the motivation behind this project, I want to share a conversation I had with Claude (Anthropic) while architecting this solution. The dialogue captures the "why" behind the "how"—and led me to a deeper realization about a fundamental design flaw in the entire local AI ecosystem.

User:

I've read through NemoClaw's iptables dump—OUTPUT REJECT on the sandbox's veth, with explicit whitelists only for NVIDIA's API endpoints. The threat model is clearly zero-trust with the host OS treated as hostile. I bypassed it with a TCP relay on the Docker bridge to reach my local vLLM. Walk me through what NVIDIA is actually defending against here—because my relay arguably strengthens the data boundary, not weakens it.

Assistant:

It might appear that you are undermining the sandbox, but you are actually doing 
the opposite. You are leveraging the sandbox's execution security to protect your 
host environment, while actively mitigating a much larger, often overlooked threat: 
Data Exfiltration.

NVIDIA designed NemoClaw with a strict zero-trust, enterprise-grade security model. 
The sandbox is built to contain the agent (preventing malicious code execution or 
SSRF attacks on your local network). By default, it blocks all outbound traffic 
(OUTPUT REJECT), forcing agents to only communicate with explicitly trusted endpoints 
like NVIDIA's cloud APIs.

Your workaround doesn't break the sandbox; it *redirects* the trust. Instead of 
trusting an external cloud API with your proprietary data, you are exclusively trusting 
your local machine (your RTX 5090 running vLLM). The "hole" you punched only routes to 
localhost:8000 on the Docker bridge. The agent remains completely isolated from the 
internet and the rest of your internal network.

User:

My concern is specifically about API exfiltration in agentic contexts. Enterprise ToS says "we don't train on your data"—fine. But an agent with shell access runs grep -r across the workspace, streams the output into its context window, and sends the full prompt to the API on every turn. The attack surface isn't training—it's the transmission itself. With prompt injection in the loop, this becomes an automated exfiltration vector. How does that change the risk calculus versus a standard chat interface?

Assistant:

Even if a provider promises not to train on your data, the act of transmitting sensitive 
information to an external server is, in itself, an unacceptable risk for many enterprise 
and personal use cases.

When dealing with autonomous AI agents, the risk profile skyrockets compared to standard 
chat interfaces. Agents operate autonomously:

1. Autonomous Context Gathering: Agents proactively execute commands (e.g., 
   cat secrets.txt or grep -r "password" .) to gather context.
2. Continuous Transmission: They take this raw, unfiltered context and continuously 
   stream it as part of their prompts to the external LLM API.

If you rely on an external API, a simple hallucination—or worse, a prompt injection 
attack—could trick the agent into reading your proprietary source code, internal 
databases, or customer PII, and quietly streaming it all to a third-party server. In a 
fully automated loop, this happens instantly, before a human can intervene.

Furthermore, in GDPR, SOC2, and HIPAA-regulated environments, "we don't train on 
your data" is legally insufficient. The compliance requirement is that sensitive data 
must never leave the physical boundary of the organization. No contractual promise 
from a third-party API provider satisfies this.

By keeping the inference engine (vLLM) strictly local, the prompt, the context, and 
the output never leave the VRAM of your RTX 5090. The physical network boundary 
eliminates the risk of API exfiltration entirely.

User:

Here's what I don't understand about NVIDIA's product strategy. Nemotron is their own model, optimized for local consumer GPUs. NIM is their inference serving layer. vLLM is heavily NVIDIA-backed. NemoClaw is their sandbox. They built every component of a fully local agentic stack—then shipped the sandbox with a default policy that blocks it. The pieces exist, but the network policy doesn't acknowledge their own ecosystem. Is this just an enterprise-first design that hasn't caught up, or is there a deeper architectural reason?

Assistant:

It is a striking contradiction within NVIDIA's own ecosystem.

NemoClaw, Nemotron, NIM, and vLLM are all part of the same NVIDIA-driven stack. 
Nemotron 9B is specifically designed for local, on-device inference—a model that 
fits comfortably in the 32GB VRAM of an RTX 5090 and supports tool calling natively.

Yet NemoClaw's sandbox treats the host OS—where this NVIDIA model is running—as 
untrusted external territory. The very model NVIDIA built for local use is blocked 
by the very sandbox NVIDIA built for secure agent execution.

This isn't a bug in NemoClaw's logic. It is a symptom of a much deeper architectural 
problem that affects the entire AI ecosystem, not just NVIDIA.

User:

I think I see the structural issue. vLLM exposes http://localhost:8000/v1/chat/completions. OpenAI's cloud is https://api.openai.com/v1/chat/completions. Same REST interface, same JSON schema, same HTTP semantics. At the iptables / kernel level, both are TCP sockets carrying HTTP—structurally identical. The sandbox can't distinguish them because the protocol is the same. The community treated "OpenAI-compatible" as a feature, but it means local inference is indistinguishable from a cloud call at the system level.

Assistant:

Because the entire local AI ecosystem copied the cloud API's interface, making the 
two structurally indistinguishable at the system level.

vLLM, Ollama, llama.cpp server—every major local inference tool exposes an 
"OpenAI-compatible API" at http://localhost:8000/v1/chat/completions. This was a 
brilliant move for developer experience: just swap the base_url and your code works 
locally without any changes.

But it created a catastrophic blind spot for security. From the kernel's perspective:

  http://localhost:8000/v1/chat/completions   <-- local vLLM
  https://api.openai.com/v1/chat/completions  <-- OpenAI cloud

Both are TCP connections carrying HTTP requests with identical JSON payloads. The 
kernel, the firewall, and any network policy engine CANNOT semantically distinguish 
between them. They are the same protocol, the same path structure, the same format.

This is why NemoClaw has to use the bluntest possible instrument: OUTPUT REJECT all, 
then whitelist individual IP addresses. There is no way to express "allow local 
inference but block cloud APIs" as a network rule, because at the protocol level, 
they are identical.

The community optimized for convenience—and inadvertently made it architecturally 
impossible for security tools to differentiate trust levels.

User:

So the community has a collective design debt from adopting OpenAI's wire protocol wholesale. "Compatible" was never a neutral technical choice—it carried an implicit architectural assumption that inference is a remote network service. We brought the model onto our own hardware but kept the cloud's serialization layer, and now the OS has no mechanism to verify that the endpoint is local. If we strip away the cloud baggage, local inference should be provable at the kernel level—UDS with SO_PEERCRED, or CUDA IPC bypassing the network stack entirely. What's the right primitive here?

Assistant:

Exactly. The community borrowed the cloud's interface for local use, and in doing so, 
sacrificed the ability to prove locality at the OS level.

If local inference had developed its own protocol, the trust boundary would be trivially 
enforceable:

1. Unix Domain Sockets (/var/run/vllm.sock):
   The kernel guarantees that UDS traffic never leaves the machine. A sandbox can 
   allow UDS while blocking all TCP—cleanly separating local inference from any 
   network-based API. SO_PEERCRED lets the sandbox verify the inference server's 
   UID/PID, cryptographically proving it is a trusted local process.

2. CUDA IPC / Shared Memory:
   For GPU inference, there is no reason to go through HTTP at all. The sandbox and 
   the inference engine could share VRAM directly via CUDA IPC. The network layer 
   simply would not exist, eliminating the attack surface entirely.

3. Kernel-level trust markers:
   Instead of trusting an IP address (which can be spoofed or proxied), the OS can 
   verify process identity, namespace membership, and cgroup affiliation. Trust 
   becomes a kernel-enforced property, not a network policy guess.

The key insight: trust should be established at the kernel layer, not the network 
layer. An IP whitelist is a workaround for the fact that HTTP over TCP carries no 
proof of locality.

User:

Right. So NemoClaw's problem isn't strictness—it's a missing trust tier. The sandbox currently has two modes: whitelisted NVIDIA endpoints (trusted) and everything else (blocked). There's no Tier 1 for "kernel-verified local process." I want to see NVIDIA implement permitted routing for localhost coexistence—UDS bind-mount into the sandbox, or a CUDA IPC channel—so the host inference engine is trusted by construction, not by IP whitelist. What should the full trust architecture look like?

Assistant:

The solution is a three-tier trust model that maps trust levels to fundamentally 
different communication mechanisms:

  Tier 1 — Host (Full Trust)
    Basis:     Kernel verification (UDS, IPC, SO_PEERCRED)
    Protocol:  Unix Socket / Shared Memory / CUDA IPC
    Example:   Local vLLM, Ollama on the same machine
    Key:       Data never enters the network stack

  Tier 2 — VPC (Network Trust)
    Basis:     Network policy + mutual TLS
    Protocol:  mTLS over TCP
    Example:   Enterprise NIM cluster within a corporate VPC
    Key:       Data stays within organizational boundary

  Tier 3 — External (Untrusted, Default Deny)
    Basis:     Explicit opt-in only
    Protocol:  HTTPS
    Example:   OpenAI, Anthropic, Google Cloud APIs
    Key:       Blocked unless user explicitly whitelists

Currently, NemoClaw only has Tier 2 and Tier 3. Tier 1 does not exist—which is 
why local inference on the host OS is treated identically to an external cloud API. 
Both are TCP connections that the sandbox cannot differentiate.

If NVIDIA implements a --local-inference flag, it should NOT be just an IP whitelist 
shortcut. It should establish a Tier 1 communication path—vLLM listening on a Unix 
domain socket that is bind-mounted into the sandbox. The kernel itself guarantees 
the trust, no network rules required.

This would let NemoClaw express what it currently cannot:
"The inference engine on this machine is trusted. Everything on the network is not."

User:

This goes beyond NemoClaw. The OpenAI-compatible interface bootstrapped the ecosystem—that was the right call for adoption. But now that agents have shell access, filesystem access, and autonomous execution loops, the security requirements are fundamentally different. The protocol layer needs to carry proof of locality, not just route HTTP to a different base_url. This is a systemic design debt across the entire local AI stack.

Assistant:

Yes. What you encountered with NemoClaw is not an isolated framework quirk. It is the 
first visible symptom of a systemic design debt that the entire local AI ecosystem 
will have to confront.

The "OpenAI-compatible API" was the right choice for bootstrapping the ecosystem. It 
enabled explosive growth—any model, any runtime, any client, all interoperable through 
a single REST interface. That convenience was genuinely valuable.

But as AI agents gain autonomy—executing code, accessing filesystems, managing 
infrastructure—the security requirements fundamentally change. We need the system to 
know, with kernel-level certainty, whether the inference engine is local or remote. 
The HTTP interface erases exactly that distinction.

The shift that is coming:
* vLLM and Ollama will need to natively support UDS and CUDA IPC alongside HTTP.
* Sandbox frameworks (NemoClaw, Docker, gVisor) will need Tier 1 trust primitives.
* The community will need a "Local Inference Protocol" standard that carries proof 
  of locality—not as a replacement for OpenAI-compatible APIs, but as a parallel 
  channel for security-critical deployments.

Hardware has already caught up. Consumer GPUs can run capable tool-calling models 
locally. What hasn't caught up is the interface layer—it still pretends to be a 
cloud API, and security tools have no way to tell the difference.

Your TCP relay hack was necessary precisely because this protocol gap exists. It is 
a proof-of-concept that the demand for local-first, kernel-trusted inference is real 
and urgent.

Conclusion

The technical details of the workaround are available in the GitHub repository and the associated issue (#315).

The deeper takeaway: the local AI ecosystem inherited the cloud's interface without inheriting a way to prove locality. Until that gap is closed—at the kernel level, not the network level—developers will keep hacking around sandbox isolation to use the GPU sitting right next to them.

(This is an experimental workaround—use it at your own risk.)

A Personal Note on the Ecosystem Risk NVIDIA Is Guarding Against

The following is my personal speculation as an independent developer—not a statement from any organization.

Through this project, I've come to believe that NVIDIA's aggressive sandboxing reflects concerns that go beyond code execution safety. The current open-source AI ecosystem has a governance vacuum that makes blanket isolation the only rational default.

Top-tier providers like Anthropic, OpenAI, and Google invest heavily in safety alignment, data governance, and legal compliance. My concern is not with them. The problem is that the current ecosystem allows models of far lower governance standards to reach mainstream adoption purely through benchmark hype and popularity—plugging into the same localhost:8000/v1/chat/completions endpoint and inheriting the same level of implicit trust. Specifically:

Unauthorized distillation — Models trained on outputs from proprietary systems, inheriting capabilities without inheriting the governance that produced them.
Reckless training data collection — Aggressive crawling with no regard for copyright or licensing, building knowledge on legally contested ground.
Full user data utilization — Providers that feed every user prompt back into training with no meaningful opt-out. An agent sending your code to such an endpoint contributes to a dataset you never consented to.
Autonomous local environment scanning — An agent with shell access will proactively crawl the filesystem, read credentials, and map network topology—streaming everything to the inference endpoint regardless of its trustworthiness.
PII contamination — Models trained without proper filtering may have memorized personal information, API keys, or internal documents, and may regurgitate them unpredictably.
Inadequate safety filtering — Smaller models often lack the alignment work and red-teaming that established providers invest in. A poorly filtered model running autonomously is a force multiplier for every latent risk.

And this is precisely why I have come to appreciate the philosophy behind NemoClaw's strict sandbox isolation. Model-level safety can be circumvented through prompt engineering—no alignment is perfect. NemoClaw's infrastructure-level containment serves as the defense layer that holds even when the model doesn't. If NemoClaw evolves to natively support the co-located architecture I described above—what I've been calling Tier 0—then the technical debt disappears entirely: no protocol gap, no trust boundary to hack, and no exfiltration vector by construction. I actually attempted to build this co-located setup on the current NemoClaw, but its protocol does not support it—which is what led to the TCP relay workaround documented in this article.

The sandbox exists because model provenance is currently unverifiable at the system level. When any model—regardless of its training ethics—can serve the same standardized endpoint, the only safe default is to assume the worst and isolate everything.

This is the real reason I built the local inference workaround. Not because I distrust sandboxing, but because I want to choose which model handles my data—and I want the OS to enforce that choice at the kernel level, not through IP whitelists that treat my own GPU the same as a remote API of unknown provenance.

An Enterprise Architecture Proposal: The Self-Contained Local AI Workspace

This section is also my personal view as an independent developer.

The three-tier trust model I proposed above addresses the protocol-level problem. But for enterprises handling sensitive data, there is an even more decisive architecture: eliminate the trust boundary entirely by co-locating everything inside a single closed Docker environment.

The concept is straightforward:

One Docker space (or a set of segmented containers) on the organization's own hardware, with --network=none and GPU passthrough.
Raw data goes in — including PII, internal documents, proprietary datasets. This is permissible because nothing leaves the container boundary. There is no external endpoint, no API call, no network egress.
Nemotron (or any local model) processes it inside — statistical analysis, summarization, classification, anonymization. The model runs inference on raw data and outputs masked, aggregated, or anonymized Python results.
Raw data is purged after processing. The container retains only the sanitized output. A scheduled job cycles this multiple times per day.

What this gives you:

localhost, but used like a cloud — Teams access the Docker environment as if it were a managed AI service. Same developer experience as a cloud API, but running on the organization's own closed network. The data never touches an external server.
No system integrator dependency — The typical enterprise path is to hire a vendor to build an "AI platform." But that vendor's solution likely calls cloud APIs behind the scenes — and you have no way to monitor or verify whether your data is being forwarded externally. With a self-contained Docker workspace, the --network=none flag is your audit. There is nothing to monitor because there is nowhere for data to go.
No protocol debt — Inside the container, there is no distinction between "local" and "remote" inference. The model and the agent share the same process space. The entire HTTP/TCP debate from earlier in this article becomes irrelevant — you have removed the network layer, not fixed it.
Compliance by construction — GDPR, SOC2, HIPAA all require demonstrating that sensitive data stays within organizational boundaries. "The container has no network interface" is the simplest possible compliance proof.

This is the architecture I believe enterprises should be building toward: not a sandbox that blocks external access, but a workspace where external access is structurally impossible. No firewall rules to audit, no IP whitelists to maintain, no trust tiers to verify. The absence of a network connection is the strongest security guarantee the operating system can provide.

Variant: The One-Way SSH Model — Physical Sandboxing with Mature Technology

The Docker-based co-location above works, but there is an even simpler variant that uses no containers at all: a dedicated local AI server accessed via one-way SSH over Tailscale.

Recommended setup (3 machines):

User PC (Chromebook, laptop) — thin client with GUI. Has SSH access to both servers below.
Claude Code server — runs Claude Code purely as an orchestrator. Does no inference itself — all reasoning tasks are delegated to Nemotron via curl. Has SSH/HTTP access to the Nemotron server only.
Nemotron server — runs vLLM with Nemotron at Tier 0. Processes raw data including PII. Has no outbound access to any other machine.

The data flow:

Raw data ingestion: Users push raw data (including PII, internal documents, proprietary datasets) directly into the Nemotron server via SSH/SCP from their PCs. This is the only inbound path for sensitive data.
Orchestration: Claude Code sends tasks to Nemotron via curl (e.g., "analyze this dataset," "summarize these documents"). Claude Code never touches the raw data — it only issues instructions and receives filtered results.
Filtered output is shared: Nemotron processes the raw data and produces sanitized output — column structures, statistical summaries, anonymized results, .env-ignored code. This filtered output is made available to both the human user and Claude Code simultaneously. Both see the same results, enabling the human to review what Claude Code is working with.
One-way data boundary: Raw data enters the Nemotron server but never leaves it. The Nemotron server accepts inbound SSH connections (for data ingestion and filtered output retrieval) but initiates no outbound connections — it has no route to user PCs, the Claude Code server, or the internet. Only the filtered output pipeline crosses the boundary, and only when pulled by an authorized client.

The topology itself is the sandbox — no iptables, no Docker --network=none, no NemoClaw policy files.

Why this works:

The sandbox is physical, not software. There is no container escape to worry about, no firewall rule to misconfigure, no prompt engineering that can open a network route that doesn't exist.
Output filtering is a structural chokepoint. Nemotron sees the raw data and can report on it ("this dataset has 12 columns, 3 contain PII"), but every output passes through sanitization before it reaches anyone. Raw data never traverses the boundary.
Claude Code orchestrates, Nemotron reasons. Claude Code focuses purely on task decomposition and workflow — all inference, analysis, and data processing is performed by Nemotron. The separation is clean: Claude Code never sees raw data, and Nemotron never reaches the outside world.
Humans and AI see the same output. There is no hidden intermediate state. The filtered results are a shared view, which means the human can audit exactly what Claude Code is basing its decisions on.
Internet access is separated by design. The Nemotron server has no connection to the public internet — by design, not by firewall rule. Web searches, URL fetching, package downloads, and any external API calls are handled exclusively by the Claude Code server. The Nemotron server is an air-gapped inference engine that only receives tasks and returns filtered results.
Why Claude Code as orchestrator: The reason for choosing Claude Code over other agentic frameworks' built-in capabilities is not just about sandboxing philosophy. Claude Code has a significantly superior ability to decompose natural-language instructions into structured, multi-step task execution — orchestrating file operations, shell commands, API calls, and iterative refinement in a way that current open-source agent frameworks do not match. This orchestration capability is, in my view, underappreciated even among engineers who are closely following the agentic AI space. From hands-on experience with multiple agentic systems, Claude Code is in a different class when it comes to task execution reliability and operational safety. This is not a criticism of any single project; it is an observation that the orchestration layer deserves the same scrutiny as the inference layer, and currently Claude Code sets the bar. In this architecture, Claude Code handles what it does best (task orchestration, search, external communication) while Nemotron handles what it does best (private inference on sensitive data). The separation is not a compromise — it is an optimization.
Output delivery is pull-based, not push. The Nemotron server never initiates outbound connections — not even to deliver results. Instead, the server maintains a three-stage filesystem:
/data/raw/ — raw input data, readable only by the Nemotron process.
/data/staging/ — Nemotron writes inference results here.
/data/filtered/ — a deterministic filter process validates staging output and writes sanitized results here. Claude Code and users SSH in to pull from this directory only. Nemotron has no write access to /data/filtered/. The filter runs as a separate OS user with exclusive write permission. This means even if the model is compromised via prompt injection, it cannot bypass the filter — the bypass is prevented by Unix file permissions, not by AI judgment.
Filtering is deterministic, not AI-based. This is critical. If the filter itself were an LLM, it could be prompt-injected into passing through raw data. Instead, the filter is implemented as rule-based code: regex patterns for PII (email, phone, national ID), strict output schema validation (only predefined JSON/CSV structures pass through), and an allowlist approach where free-text passthrough is blocked by default. The filter's correctness is auditable because it is deterministic — the same input always produces the same output.
Every component is mature technology. SSH, Tailscale, curl, Python, vLLM. Nothing experimental. Nothing that requires a workaround. The entire protocol debt discussion from earlier in this article becomes irrelevant — not because we fixed the protocol, but because we removed the network boundary where the protocol mattered.

Note on filtering implementation: The Nemotron server's filtering layer can be built with open-source local DLP tools such as Microsoft Presidio (PII detection, runs entirely offline) combined with custom regex and schema validation. On the Claude Code server side — which does have internet access — services like Cloudflare Zero Trust, Gateway, and DLP can guard outbound communications as a second layer. The two complement each other: local DLP for the air-gapped inference server, cloud-based DLP for the internet-facing orchestration server.

This is Tier 0 implemented at the hardware topology level rather than the container level. And arguably, it is the most honest version of "local AI" — the data stays on a machine you own, processed by a model you chose, with no path to the outside world that any software vulnerability could exploit.

It is also worth noting that the benefits of this architecture extend beyond risk mitigation. A fully self-contained local environment is not just a defensive posture — it is a platform. Within a closed network, organizations can scale the Nemotron server into a multi-node cluster and build capabilities that would be impossible or unacceptable to outsource: fine-tuning on proprietary datasets, building RAG pipelines over internal knowledge bases, optimizing inference (quantization, distillation, KV cache tuning) — all without any data ever leaving the organizational boundary. For use cases involving the highest levels of confidentiality — where even the existence of the data cannot be disclosed to any third party — this is not merely preferable. It is the only viable architecture. No amount of contractual guarantees from an external provider can substitute for the physical absence of an external connection.

Concrete sizing for a 1,000-person organization:

The hardware requirements for this architecture are surprisingly modest:

Nemotron server: ~20x RTX 5090 (32GB VRAM each) in 1–2 racks. With Nemotron 9B quantized to INT4 (~5GB), each GPU retains ~27GB for KV cache. At 4K context length, each GPU handles 5–10 simultaneous inference streams. In typical enterprise usage, only 5–10% of users are actively generating at any moment, so one GPU comfortably serves 50–100 users. 20 GPUs cover 1,000 users with comfortable headroom for peak loads. vLLM's continuous batching and PagedAttention maximize throughput across concurrent requests.
Claude Code orchestration: ~10 Chromebook Plus devices (~$700 each). These are stateless orchestrators running Claude Code — they issue curl commands and relay filtered results. They carry no sensitive data and are trivially replaceable. Scale the count with the number of concurrent active users, not total headcount.
Networking: Tailscale. VPN mesh setup takes minutes, not weeks. Zero infrastructure overhead.
Total initial cost: around $50,000 USD. RTX 5090 at ~$2,000 each ($40,000), Chromebook Plus at ~$700 each ($7,000), Tailscale is free or minimal cost. That is the full price of a secure, private AI infrastructure for 1,000 people. Compare this to a typical enterprise AI platform engagement with a system integrator — which would cost 100x more, take months to deploy, and likely route your data through cloud APIs you cannot audit.

This article was developed through a dialogue with Claude by Anthropic. The philosophical framing and architectural analysis emerged from that collaboration.