PatentLLM Blog →日本語

HanreiLLM PatentLLM SubsidyDB RAG Eng Apps Live GitHub Inquiry
← All Articles Read in Japanese
AI

Running Karpathy's autoresearch with Local LLM — Zero API Cost Autonomous AI Research

Introduction

Andrej Karpathy (OpenAI co-founder) released autoresearch — an experiment where an LLM autonomously modifies a GPT training script, runs 5-minute experiments, keeps what improves val_bpb, and discards what doesn't. The original uses Claude Code (cloud API) as the researcher.

SohniSwatantra's fork replaces Claude Code with Qwen 3.5 9B running locally via ollama. Single GPU, zero API cost, fully autonomous.

Architecture: LLM + Training on One GPU

The key innovation is running both the LLM agent and GPT training on the same GPU:

GPU (48GB VRAM)
├── Qwen 3.5 9B via ollama (~12GB)
└── GPT training via train.py (~35GB)

To fit within VRAM constraints, hyperparameters are adjusted from the original:

Component Original This Fork
Depth 8 layers 4 layers
Device batch size 128 64
Total batch tokens 524K 65K
Window pattern SSSL L

The model is smaller, but the agent compensates by running more experiments.

The Autonomous Research Loop

Step 1: LLM Proposes a Modification

agent.py sends the current train.py code and experiment history (results.tsv) to Qwen 3.5. The LLM proposes specific code modifications to lower val_bpb.

The prompt includes clear constraints: - Only train.py can be modified (prepare.py is read-only) - No new package installations - Fixed 5-minute time budget - ~35GB VRAM available for training

Step 2: Syntax Validation + Git Commit

The proposed code is validated with ast.parse(). If valid, train.py is overwritten and git committed.

Step 3: Run 5-Minute Experiment

uv run train.py executes with a 10-minute timeout (normally completes in 5 minutes).

Step 4: Keep or Discard

A failsafe resets to baseline after 3 consecutive crashes.

agent.py Design

The entire agent is ~250 lines in a single file:

The code extraction pipeline is elegant — regex finds Python code blocks, ast.parse() validates syntax, only valid code proceeds to experimentation:

def extract_code_from_response(response):
    blocks = re.findall(r"```(?:python)?\s*\n(.*?)```", response, re.DOTALL)
    if blocks:
        return max(blocks, key=len)  # Take the longest code block

Cost Comparison

Setup Cost per experiment 100 experiments
Original (Claude Code API) ~$0.05-0.20 $5-20
This fork (Nosana Pro 6000) $0.08 ~$8
This fork (own GPU) $0 $0

program.md — The Research Philosophy

Karpathy's original program.md contains key design philosophies:

This is the essence of autoresearch: let AI do research while you sleep.

Why Local LLM Matters

This fork demonstrates that: - Qwen 3.5 9B (a 9B parameter model) can sustain autonomous ML research loops - No rate limits or API costs — run infinitely - Anyone with a 24GB+ GPU can automate their own research

Setup

# Install ollama and pull the model
curl -fsSL https://ollama.com/install.sh | sh
ollama serve &
ollama pull qwen3.5:9b

# Clone and setup
git clone https://github.com/SohniSwatantra/autoresearch-local-llm.git
cd autoresearch-local-llm
pip install uv && uv sync

# Run
bash run_pipeline.sh

Requires 24GB+ VRAM (48GB recommended).

Daily Tech Digest Curated AI & dev news from 15+ international sources, delivered daily