I build auditable AI infrastructure: control planes, audit gates, and Blackwell (sm_121) kernel debugging for GPU work.

I committed patches to FlashInfer and CUTLASS for NVIDIA's sm_121 — the Blackwell variant on GB10. I built an agentic research system that generates AI research papers and flags 0 of 388 canonical outputs as not meeting my audit bar. I work at the seam between datacenter GPU hardware and the software that tries to pretend hardware doesn't matter.

Oklahoma City, OK

Hardware & Kernel Work

Work on NVIDIA's Grace Blackwell consumer architecture (sm_121 / DGX Spark) — LPDDR5X memory, blockscaled MMA paths, and TMA async data movement, distinct from H100/A100/H200/B200/B300 in ways that break most vendor kernels. Committed patches, documented Xid faults, and ran private lab campaigns.

FlashInfer SM121

Kernel debugging and patching for Blackwell

Debugged and patched FlashInfer for sm_121. When primary kernel paths produced Xid 13 and Xid 43 GPU faults, worked through the CuTe-DSL fallback path as a secondary route. Documented illegal instruction errors, misaligned addresses, and warp exceptions found in NVIDIA's kernel code running on sm_121. Built a systematic debugging campaign with variant matrices and regression runbooks for a platform whose architecture (LPDDR5X, blockscaled MMA, TMA) differs from H100/A100/H200/B200/B300.

CUTLASS Blockscaled TMA

Patches to NVIDIA's CUTLASS for sm_121 TMA operations

Modified NVIDIA's CUTLASS library for sm_121 blockscaled TMA operations — fixing how the Tensor Memory Accelerator loads data for the Blackwell blockscaled MMA path. Patches cover sm_100, sm_120, and sm_121 layout and builder headers.

Blackwell Inference Patches

vLLM, llama.cpp, and Mamba kernel work for GB10

Extended vLLM with SM120/SM121 compute capability mapping and MXFP4 backend detection. Added MXFP4/MoE tuning and BLACKWELL-OPT compilation flags to llama.cpp. Optimized Mamba SSM kernels: d_state reduction from 128→64 (27–34% faster on LPDDR5X), custom Triton kernels with explicit backward passes, and nsight profiling harnesses with A/B testing across BF16/TF32/torch.compile modes.

Project Squeegee

FP4 hotpatches and diffusion LM research

Seven FlashInfer SM121 FP4 hotpatches — workarounds for broken vendor code causing Xid 13 errors on GB10. Diffusion LM research pipeline with curriculum generation, education quality scoring, and reasoning parsers. Tracked and documented GPU crashes from vendor kernel code.

Work

Enoch

Agentic research control plane

Enoch treats autonomous AI failure modes as infrastructure problems: stale queues, hidden worker state, orphaned processes, GPU contention, scattered evidence, and reports that overstate results. It manages queue state, gates dispatch, supervises local AI runs, preserves evidence, and packages AI-generated research artifacts with provenance metadata and claim ledgers.

The goal is not to make autonomous AI look smarter. The goal is to make its work inspectable: what ran, when it ran, what evidence was captured, what claims were made, and where uncertainty remains.
Control Plane API
Queue state, dispatch decisions, project state, pause and maintenance controls
Wake Gate
Confirms a run is complete via process-tree tracking and CPU/GPU quiet-window telemetry
Worker Preflight
Authenticated health checks before dispatch — fails early rather than silently
Single-Lane Safety
Prevents overlapping GPU work on constrained hardware; control plane holds the lock
Evidence Sync
Copies run notes, metrics, evidence bundles, and claim ledgers before artifact generation
Artifact Writer
Generates publication-style reports from evidence, preserving uncertainty and provenance
Packaging/provenance Gates
Separates formatting/provenance checks from the stricter claim/evidence audit before corpus entry

My role is the infrastructure: the control plane, dispatch gates, telemetry, artifact packaging, provenance model, and release process. The Enoch corpus indexes 388 canonical AI-generated research artifacts after duplicate-slug cleanup and later imports; my own strict claim/evidence audit now passes all 388. That audit status is headlined on the project's front page — the audit gate is the product, not the papers. I do not claim personal authorship of the generated papers.

CouncilRouter

Multi-model AI deliberation proxy

CouncilRouter explores whether multi-model critique can reduce blind spots in complex reasoning, code review, and architecture decisions. It routes requests to 300+ externally-hosted models via OpenRouter with multi-round peer review, code-aware synthesis, and a Devil's Advocate module that challenges consensus with critical analysis.

It treats consensus as a signal, not proof.
Deliberation Engine
Multi-round peer review across models with configurable rounds and graceful degradation
Code-Aware Synthesis
Detects code, compares functional equivalence, validates syntax, security, and error handling
Devil's Advocate
Challenges consensus with critical analysis at configurable intensity
Production Layer
PostgreSQL + Redis, REST API, JWT/API key auth, rate limiting, idempotency, SSE streaming

Architecture Research

CoSpec

Spectral initialization from token co-occurrence

Research on whether variance-matched spectral initialization from token co-occurrence matrices improves early training dynamics for GPT-2-style models. Completed study with controlled experiments across five conditions (baseline, e_only, h_only, e_plus_h, spectrum_random). Result: the e_only condition — initializing only the embedding matrix from co-occurrence spectra — beat baseline in both screening and confirmation runs. Rust backend for co-occurrence accumulation performance.

Opus

Hybrid Mamba-2 SSM + differential attention

Hybrid architecture combining Mamba-2 SSM with differential shared attention (Attn₁ - λ·Attn₂) and LoRA depth adapters. 5:1 SSM-to-attention ratio. Benchmarked on GB10 with throughput measurements. FP8 training recipe adapted from DeepSeek-V3. Configurable model sizes from 125M to 7B.

Project Lattice

PaRT architecture and training infrastructure

PaRT (Patch-and-Refine Transformer): patch-based downsampling with cross-attention refinement, designed for memory-bandwidth-constrained hardware. Built LatticeDash, a real-time training dashboard with WebSocket streaming, convergence tracking, and curriculum learning. NVFP4 MLP precision experiments.

TurboQuant

Quantization research reproduction

Local reproduction and evaluation of Google's TurboQuant paper. Implemented MSE codec (random rotation + Lloyd-Max codebook), product codec (MSE + QJL residual), split codec (mixed-precision channel split), and grouped passkey scoring with KV-cache proxy evaluation.

ArborealMoE

Recursive MoE with per-token halting — and why it failed

Experimental PyTorch architecture with a recursive mixture-of-experts tree, per-token adaptive halting, and path-conditioned LoRA specialization. At tiny scale, routing worked mechanically but specialization did not meaningfully emerge — the model converged toward shallow, uniform routing. Documented routing collapse, expert capacity constraints, and scale recommendations for future runs. Published for the negative result, not the architecture.

Engineering Focus

AI Infrastructure Control Planes
Dispatch safety, GPU worker state, telemetry, evidence sync, artifact provenance
Enterprise AI Operations
Large-scale datacenter deployment, Day 2 operations, triage, failure analysis, process development
NVIDIA Accelerated Systems
Grace Blackwell (sm_121) kernel debugging, CUTLASS TMA patches, FlashInfer CuTe-DSL fixes, MXFP4/MoE inference, TensorRT-LLM benchmarking, constrained local GPU operations
Local-First Infrastructure
Proxmox, OPNsense, ZFS, monitoring, alerting, backup validation, recovery automation, security hardening

Operational Background

Contact

Personal site and independent projects. Views and work are my own.