INT8 Q/DQ Calibration on Blackwell: 1.8× the TRT 10 + FP16 Baseline

TL;DR

On Blackwell (RTX 5090) + TensorRT 11, proper INT8 post-training quantization delivers 71k NPS — 1.8× the previous TRT 10 + auto-FP16 baseline on the same model, with no measurable strength loss.

The recipe is short:

1. Build a calibration set that matches your deployment distribution (here: 1,500 stratified positions from one year of public game records).
2. Hand it to NVIDIA ModelOpt with quantize_mode='int8'. Wall time: 56 seconds.
3. Hand the resulting Q/DQ ONNX to TensorRT 11. Done.

The case study is a commercial shogi evaluation network (dlshogi + the paid Kanade model). Same workflow applies to any FP32 ONNX you cannot retrain.

Headline numbers, same engine binary, same hardware:

| Build | Median NPS |
|---|---|
| TRT 10 + BuilderFlag::kFP16 (the previous baseline) | 39.5k |
| TRT 11 + INT8 Q/DQ ONNX (this post) | 71k sustained, 384k cache-hit bursts |

This post is the engineering writeup of that second row.

---

1. Why this is the right path on Blackwell, not a workaround

TRT 11 ships with strongly typed networks as the default. From the official docs:

> Use explicit quantization with Strong Typing. Precision-control build flags are not required and should not be specified.

BuilderFlag::kFP16, kINT8, kBF16, kFP8, and the IInt8EntropyCalibrator2 C++ API are gone. NVIDIA's deliberate position: the precision decision belongs in the model graph, not in a builder flag picked at engine-build time.

On Blackwell specifically, this matters because the 5th-generation Tensor Core has dedicated INT8 and FP8 paths that simply do not get selected from a flag-tagged FP32 network on sm_120. The kernels TRT 11 picks for an INT8 Q/DQ ONNX are the kernels that map to the actual silicon. An auto-converted FP32-with-FP16-flag path doesn't see them.

So the engineering plan on Blackwell, for any FP32 ONNX you need to deploy, is:

- Generate a Q/DQ ONNX with explicit quantize/dequantize nodes around the layers you want quantized.
- Calibrate it against representative data.
- Let TRT 11 build it.

That's what the rest of this post implements.

---

2. The Setup

- Engine: dlshogi (upstream, May 2026 snapshot)
- Model: Kanade — a paid, competition-grade ResNet-style policy/value network (115 conv layers, 188 MB FP32 ONNX, opset 17)
- Track record: WCSC35 6th, Denryu-sen TSEC 3rd, Kaiochi-sen 5th
- Hardware: RTX 5090 (Blackwell, sm_120, 32 GB GDDR7), Core Ultra 9 285K, 64 GB DDR5
- OS: Windows 11 + WSL2 Ubuntu 24.04 (55 GB allocated to WSL)
- Stack: CUDA 13.2, TensorRT 11.0.0.114, driver 595.79

Previous baseline on the same machine: an internal build (shogi4) using TensorRT 10.15 + BuilderFlag::kFP16, hitting ~39.5k NPS with a confirmed 78% winrate against the previous-generation Suisho on a small private CSA server. That's the bar to clear.

---

3. The calibration set

INT8 quality is decided here, not in the quantization call. The quantizer is going to pick per-layer INT8 scales from histograms of activation values it observes. Garbage histograms — narrow set of opening positions, no endgame coverage, only one engine's playing style — produce a quantized model that's fast but plays poorly on the long tail.

The asset we used: ~/shogi_book/extracted/2026/, 37,897 CSA-format game records from this year's Floodgate runs. Multi-engine, multi-style, exactly the distribution our deployed engine will see in production.

Target: 1,500 unique positions, stratified by game phase. Specifically:

- 25% from the opening (plies 10–30, skipping the first 10 which are always book-standard),
- 50% from the middlegame (plies 30–80, where activation ranges are widest),
- 25% from the endgame (plies 80+, skipping the last 5 to avoid near-mate positions whose policy logits are degenerate).

import cshogi
from cshogi import dlshogi as dl
import numpy as np, random, glob

# 250 random games, stratified by phase
csa_files = random.sample(
    glob.glob('~/shogi_book/extracted/2026/wdoor+floodgate-300-10F+*.csa'),
    250,
)

def stratified_plies(n):
    plies = []
    if n > 35:
        plies += random.sample(range(10, min(30, n - 5)), 2)
        plies += random.sample(range(30, min(80, n - 5)), 4)
    if n > 82:
        plies += random.sample(range(80, n - 5), 2)
    return plies

samples_f1, samples_f2, seen = [], [], set()

for csa in csa_files:
    parser = cshogi.Parser(); parser.parse_csa_file(csa)
    moves = parser.moves
    if len(moves) < 30:
        continue
    targets = stratified_plies(len(moves))
    board = cshogi.Board()
    for i, m in enumerate(moves):
        if i in targets and board.sfen() not in seen:
            seen.add(board.sfen())
            f1 = np.zeros((dl.FEATURES1_NUM, 9, 9), dtype=np.float32)
            f2 = np.zeros((dl.FEATURES2_NUM, 9, 9), dtype=np.float32)
            dl.make_input_features(board, f1, f2)
            samples_f1.append(f1); samples_f2.append(f2)
        board.push(m)
        if len(samples_f1) >= 1500:
            break

np.savez('kanade_calib.npz',
         input1=np.stack(samples_f1),
         input2=np.stack(samples_f2))

Output: 1,500 unique positions from 215 different games, 58 MB .npz. Generation: under a second on this machine.

Two design notes worth pulling out:

- Uniqueness matters more than count. Many games share the first 8–10 moves; deduplicating by SFEN strips out the bias toward textbook openings.
- Phase stratification beats pure random. Random sampling from a large game corpus over-weights opening positions, because every game has an opening but not every game reaches a 100-ply endgame. Stratifying explicitly forces the histograms to cover the activation ranges the model actually sees mid- and endgame.

---

4. The quantization step

NVIDIA's TensorRT Model Optimizer is the supported way to do this. It's the replacement for the C++ IInt8EntropyCalibrator2 API that got removed in TRT 11.

import modelopt.onnx.quantization as moq
import numpy as np

data = np.load('kanade_calib.npz')

moq.quantize(
    onnx_path='model.onnx',
    quantize_mode='int8',
    calibration_data={'input1': data['input1'], 'input2': data['input2']},
    calibration_method='entropy',     # KL-divergence-optimal scales
    calibration_eps=['cuda:0'],       # run calibration forward passes on GPU
    high_precision_dtype='fp16',      # non-quantized parts in FP16
    output_path='model_int8.onnx',
)

What this actually does:

1. Runs all 1,500 calibration samples through the FP32 ONNX, collecting per-tensor activation histograms.
2. For each Conv-class layer, picks an INT8 scale that minimizes the KL divergence between the original FP32 distribution and the quantized one.
3. Inserts QuantizeLinear / DequantizeLinear (Q/DQ) ops around each quantized layer in the ONNX graph.
4. Leaves precision-sensitive layers (Softmax, LayerNorm, reductions) in high_precision_dtype — FP16 in this configuration.

Output: 94 MB ONNX, 83 of 368 nodes quantized, the rest mixed-precision FP16. Total wall time: 56 seconds on this machine.

The resulting graph is a mixed-precision network: bulk Conv math runs in INT8 (cheap, Tensor-Core-native on Blackwell), with FP16 / Q boundaries feeding the precision-critical heads. This is conceptually what TRT 10's auto-FP16 was trying to approximate at build time, but now it's explicit, reproducible, and uses kernels that didn't exist on pre-Blackwell silicon.

---

5. The bench

UCT_Threads=6, DNN_Batch_Size=128, UCT_NodeLimit=30M (more on that below), 30-second think on the startpos:

info nps 71320 time 7063 nodes 503739 hashfull 50 score cp 250 depth 27 ...
info nps 70846 time 10978 nodes 777915 hashfull 77 score cp 251 depth 30 ...
bestmove 2g2f

Depth 30 in 11 seconds, +250 cp opening evaluation (a sensible read on the startpos), and 2g2f as the chosen move. The model behaves like itself, just much faster.

A note on UCT_NodeLimit. Yamaoka's well-known recommendation is to leave this at the 10M default because of float-precision saturation in the visit-count accumulator (his original post, 2020). That recommendation predates the WIN_TYPE_DOUBLE build flag, which is now enabled in our build:

#ifdef WIN_TYPE_DOUBLE
typedef double WinType;   // <-- this build
#else
typedef float WinType;
#endif

double pushes the precision wall from ~10^7 visits to ~10^15. The 10M-node cap is no longer the binding constraint on a WIN_TYPE_DOUBLE build; memory is. We picked 30M (≈12 GB host RAM with this MCTS implementation) as the sweet spot — comfortably under WSL's 55 GB cap, easily within the precision budget of a double accumulator.

---

6. Live results on Floodgate

Deployed the INT8 build on the public Floodgate server (wdoor.c.u-tokyo.ac.jp:4081) as miao2-dlshogi over a 24-hour run:

- Total: 33 games, 17W-16L (~51% winrate against the public field).
- First 5 games: 5-0, including beating Suisho11-V940dev-peta1115 (latest Suisho dev build) twice and tanuki_wcsc33_473stb_1000k.
- Notable wins: Suisho11 dev (×2), test768 (×4), AKATSUKI_Hyohga, AobaNNUE_N150, clomeprop, C2AC562E-YO-NNUE.

The mid-tier of Floodgate is brutal — every engine there is either a real competition entry or someone's well-tuned dev build. Holding ~51% from a cold start, with no opening book and no engine-specific tuning, is a clean signal that the quantization didn't degrade the model.

(There's also an operational point hiding in the 24-hour result: ReuseSubtree=true keeps the MCTS tree alive across games, and on a long run the asynchronous tree GC slowly accumulates residual memory. WSL gets killed around the 16–17 hour mark. Practical fix: restart the client process every 8 hours. That's a runbook item, not a correctness issue.)

---

7. What this post is actually advocating

If you have an FP32 ONNX you cannot retrain (a paid model, a customer-supplied weights file, an older internal artifact) and you're deploying on Blackwell, the workflow is:

- Build a calibration set from your actual deployment distribution. Stratify it. Deduplicate it.
- modelopt.onnx.quantization.quantize(..., quantize_mode='int8', calibration_method='entropy').
- Hand the result to TRT 11 with the now-trivial code path.

That's it. The hardware does the rest.

The two anti-patterns to skip:

- Casting weights to FP16 / BF16 without calibration in the hope that "TRT will figure it out" the way TRT 10's auto-FP16 flag used to. On Blackwell, this leaves the INT8/FP8 Tensor Core paths unused, and for any FP32 model with non-trivial activation ranges (anything with softmax-class heads in particular) it also risks correctness on the FP16 side.
- Calibrating on toy data. The histograms are what set the per-layer scales. Quality of the calibration set is the single biggest knob in the entire pipeline.

Next step on this engine is the obvious one: change quantize_mode='int8' to 'fp8' and rerun. Blackwell's FP8 Tensor Core path is designed for exactly this workflow, and the calibration set is already on disk. We'll write up that result separately once it has Floodgate hours on it.

---

If you're doing similar INT8/FP8 work on a competition-grade shogi or game-AI model — or any FP32 ONNX on Blackwell — happy to compare calibration designs and results.