I committed patches to FlashInfer and CUTLASS for NVIDIA's sm_121 — the Blackwell variant on GB10. I built an agentic research system that generates AI research papers and flags 0 of 388 canonical outputs as not meeting my audit bar. I work at the seam between datacenter GPU hardware and the software that tries to pretend hardware doesn't matter.
Oklahoma City, OK
Work on NVIDIA's Grace Blackwell consumer architecture (sm_121 / DGX Spark) — LPDDR5X memory, blockscaled MMA paths, and TMA async data movement, distinct from H100/A100/H200/B200/B300 in ways that break most vendor kernels. Committed patches, documented Xid faults, and ran private lab campaigns.
Debugged and patched FlashInfer for sm_121. When primary kernel paths produced Xid 13 and Xid 43 GPU faults, worked through the CuTe-DSL fallback path as a secondary route. Documented illegal instruction errors, misaligned addresses, and warp exceptions found in NVIDIA's kernel code running on sm_121. Built a systematic debugging campaign with variant matrices and regression runbooks for a platform whose architecture (LPDDR5X, blockscaled MMA, TMA) differs from H100/A100/H200/B200/B300.
Modified NVIDIA's CUTLASS library for sm_121 blockscaled TMA operations — fixing how the Tensor Memory Accelerator loads data for the Blackwell blockscaled MMA path. Patches cover sm_100, sm_120, and sm_121 layout and builder headers.
Extended vLLM with SM120/SM121 compute capability mapping and MXFP4 backend detection. Added MXFP4/MoE tuning and BLACKWELL-OPT compilation flags to llama.cpp. Optimized Mamba SSM kernels: d_state reduction from 128→64 (27–34% faster on LPDDR5X), custom Triton kernels with explicit backward passes, and nsight profiling harnesses with A/B testing across BF16/TF32/torch.compile modes.
Seven FlashInfer SM121 FP4 hotpatches — workarounds for broken vendor code causing Xid 13 errors on GB10. Diffusion LM research pipeline with curriculum generation, education quality scoring, and reasoning parsers. Tracked and documented GPU crashes from vendor kernel code.
Enoch treats autonomous AI failure modes as infrastructure problems: stale queues, hidden worker state, orphaned processes, GPU contention, scattered evidence, and reports that overstate results. It manages queue state, gates dispatch, supervises local AI runs, preserves evidence, and packages AI-generated research artifacts with provenance metadata and claim ledgers.
My role is the infrastructure: the control plane, dispatch gates, telemetry, artifact packaging, provenance model, and release process. The Enoch corpus indexes 388 canonical AI-generated research artifacts after duplicate-slug cleanup and later imports; my own strict claim/evidence audit now passes all 388. That audit status is headlined on the project's front page — the audit gate is the product, not the papers. I do not claim personal authorship of the generated papers.
CouncilRouter explores whether multi-model critique can reduce blind spots in complex reasoning, code review, and architecture decisions. It routes requests to 300+ externally-hosted models via OpenRouter with multi-round peer review, code-aware synthesis, and a Devil's Advocate module that challenges consensus with critical analysis.
Research on whether variance-matched spectral initialization from token co-occurrence matrices improves early training dynamics for GPT-2-style models. Completed study with controlled experiments across five conditions (baseline, e_only, h_only, e_plus_h, spectrum_random). Result: the e_only condition — initializing only the embedding matrix from co-occurrence spectra — beat baseline in both screening and confirmation runs. Rust backend for co-occurrence accumulation performance.
Hybrid architecture combining Mamba-2 SSM with differential shared attention (Attn₁ - λ·Attn₂) and LoRA depth adapters. 5:1 SSM-to-attention ratio. Benchmarked on GB10 with throughput measurements. FP8 training recipe adapted from DeepSeek-V3. Configurable model sizes from 125M to 7B.
PaRT (Patch-and-Refine Transformer): patch-based downsampling with cross-attention refinement, designed for memory-bandwidth-constrained hardware. Built LatticeDash, a real-time training dashboard with WebSocket streaming, convergence tracking, and curriculum learning. NVFP4 MLP precision experiments.
Local reproduction and evaluation of Google's TurboQuant paper. Implemented MSE codec (random rotation + Lloyd-Max codebook), product codec (MSE + QJL residual), split codec (mixed-precision channel split), and grouped passkey scoring with KV-cache proxy evaluation.
Experimental PyTorch architecture with a recursive mixture-of-experts tree, per-token adaptive halting, and path-conditioned LoRA specialization. At tiny scale, routing worked mechanically but specialization did not meaningfully emerge — the model converged toward shallow, uniform routing. Documented routing collapse, expert capacity constraints, and scale recommendations for future runs. Published for the negative result, not the architecture.
Personal site and independent projects. Views and work are my own.