Benchmarks & Acceleration Research

Evals

Performance benchmarks for every CuteDSL-accelerated model and a breakdown of each acceleration technique we're researching and shipping.

CuteChronos2

SOTA time series forecasting — RTX 5090, B=1, context=768, base model (768 d_model, 12 layers), H=16

ConfigurationModeLatencySpeedupMemory
Original Chronos2PipelineBaseline41.98 ms1.0x248 MB
CuteChronos2 (eager)Eager19.15 ms2.19x247 MB
CuteChronos2 (torch.compile)Compiled1.55 ms27.1x237 MB
CuteChronos2 + TQ4 (compiled)Quantized16.40 ms2.56x~200 MB

Custom Kernels (8 Triton + 1 CUDA)

Unscaled Tiled Attention

triton_kernels/attention.py

FlashAttention-style tiling without 1/sqrt(d_k) scaling. Avoids materializing S×S attention matrix. FP32 softmax for numerical stability.

RMS LayerNorm

triton_kernels/rms_layernorm.py

T5-style RMS normalization in a single Triton kernel. FP32 variance computation, no mean subtraction.

Fused RoPE

triton_kernels/rope.py

Rotary Position Embeddings fused into one kernel: inv_freq computation + cos/sin + Q/K rotation.

Fused LayerNorm + Linear

triton_kernels/fused_layernorm_linear.py

Merges RMS LayerNorm and linear projection. Eliminates the normalized intermediate tensor entirely.

Fused MLP

triton_kernels/fused_mlp.py

Two-layer MLP with ReLU activation in a single kernel pass. Avoids 3072-wide intermediate buffer allocation.

Fused Output Transform

triton_kernels/fused_output.py

Output rearrange + sinh + unscale in a single kernel pass over the output tensor.

Fused Preprocessing

triton_kernels/fused_preprocess.py

NaN-aware preprocessing pipeline. Two-phase: shared memory reduction for statistics, then transform.

CUDA Preprocessing

cpp/preprocessing.cu

C++/CUDA NaN-aware normalization and patching with shared memory reductions for per-series statistics.

CuteZImage

Z-Image Turbo transformer — 30 layers, dim=3840, 30 heads, SiLU-gated FFN (hidden=10240)

ConfigurationModeLatencySpeedupVRAM
Original Z-Image TurboBaseline105.79 ms1.0x11,983 MB
CuteZImage (eager)Eager93.30 ms1.13x11,977 MB
CuteZImage (torch.compile)Compiled90.99 ms1.16x11,883 MB

Output correctness: max_abs_error = 0.0 (bit-exact match with original)

Custom Kernels (5 Triton + 3 CUDA)

Fused SiLU-Gated FFN

triton_kernels/fused_silu_gate_ffn.py

Fuses silu(w1(x)) * w3(x) in a single kernel. Eliminates 10240-wide intermediate allocation that dominates memory bandwidth.

Fused AdaLN + RMS Norm

triton_kernels/fused_adaln_norm.py

Timestep conditioning fused with normalization in one pass instead of separate modulation + norm.

Complex-valued RoPE

triton_kernels/rope_complex.py

Fuses reshape + complex multiply + flatten for Z-Image's complex multiplication convention. Avoids intermediate complex tensor allocations.

RMS Norm

triton_kernels/rms_norm.py

Triton-accelerated RMS normalization with FP32 variance and optional weight scaling.

CUDA RMS Norm

csrc/cute_rms_norm.cu

Vectorized 128-bit loads/stores (bfloat162, half2, float4) for maximum memory bandwidth.

CUDA SiLU Gate

csrc/cute_silu_gate.cu

Vectorized SiLU activation + gating for float32, float16, and bfloat16 with coalesced memory access.

Fused QKV + Norm + RoPE

triton_kernels/fused_qkv_norm_rope.py

Combined QKV projection, normalization, and RoPE in a single multi-operation fusion kernel.

Z-Image Inference Pipeline

End-to-end generation latency — Python server vs Go+C migration — RTX 5090, 9-step Z-Image Turbo

ConfigurationModeLatencySpeedupMemory
Python (CPU offload)Eager~32,000 ms1.0x~7 GB
Python (GPU resident)Eager~31,500 ms1.02x~18 GB
Python + CuteDSL KernelsFused~30,000 ms1.07x~18 GB
Go+C (Python embed)CGO Bridge~31,200 ms1.03x~18 GB
Go+C (LibTorch native)Projected~100 ms~320x~14 GB
Go+C + NVFP4Projected~60 ms~533x~8 GB

Note: Current ~32s latency is dominated by Z-Image Turbo's 9-step diffusion loop, not Python overhead. The projected LibTorch numbers assume direct CUDA execution without Python/torch runtime dispatch.

LoRA Search Benchmarks

Semantic LoRA selection — keyword matching vs embedding similarity (gobed)

EngineLatencyAccuracy
Keyword (Python)<0.1 msGood
Embedding (Python, sentence-transformers)4 msExcellent
Embedding (Go, gobed)<1 msExcellent
Embedding + Negative (Python)5 msBest
Embedding + Negative (Go, gobed)<1 msBest

TurboQuant Ablations

MAE vs latency across quantization configs — 7 symbols, context=768

ConfigCompressionH=16 LatencyH=16 MAEH=64 LatencyH=64 MAEH=128 LatencyH=128 MAE
Original PyTorch1.0x41.98 ms1645.139.78 ms4768.538.96 ms6497.4
CuteChronos2 Eager1.0x19.15 ms970.818.47 ms1496.320.77 ms1441.1
CuteChronos2 Compiled1.0x1.55 ms943.41.65 ms1499.71.63 ms1461.0
TQ4 (product+MSE)3.66x150.79 ms970.7124.96 ms1471.9123.35 ms1474.1
TQ4 + Compiled3.66x16.40 ms966.122.96 ms1498.020.61 ms1467.0
TQ3 (product+MSE)4.74x346.47 ms981.8343.58 ms1665.1344.55 ms2045.3

Rolling Forecast Ablation (448 windows)

ConfigSequential TotalPer WindowBatched TotalBatched/WindowMAE
Original PyTorch18,041 ms40.27 msN/AN/A4263.2
CuteChronos2 Eager11,439 ms25.53 ms400.6 ms0.89 ms317.8
CuteChronos2 Compiled753.2 ms1.68 ms52.3 ms0.117 ms317.2
TQ4 + Compiled9,312 ms20.78 ms417.3 ms0.93 ms320.2

Rolling forecast: 448 sequential 1-step predictions over 7 crypto/equity symbols (BTC, ETH, SOL, TSLA, AAPL, NVDA, LINK)

Acceleration Techniques

Everything we're building, researching, and shipping to make models faster

Latent Teleportation

latentteleport/
Research

Pre-compute and cache intermediate diffusion latents for common prompt patterns. Interpolate between cached latents using SLERP or neural combiners, then refine from the interpolated latent with fewer denoising steps. Reduces effective diffusion steps from ~20 to ~5.

SLERP interpolation (custom Triton kernel)Neural network combinersTree-based hierarchical combinationNLP / curated / CLIP tokenizer strategiesConfidence estimation for cache hits

TurboQuant

tubroquant/
Research

Vector quantization for KV cache and attention optimization. MSE and product quantization modes with Hadamard rotation for improved compression. 2-16x compression with configurable bit-widths.

MSE vector quantization with codebook lookupProduct (residual) quantizationHadamard rotation pre-processingNorm-aware reconstructionCUDA kernel for quantized QK attention scoresBit-packing with warp-level reductions

torch.compile + CUDA Graphs

cutechronos/model.py
Production

Captures the full forward pass as a CUDA graph via torch.compile with reduce-overhead mode. Eliminates kernel launch overhead entirely. Requires fixed input shapes. Achieves 24.4x speedup on CuteChronos2.

reduce-overhead compile modeCUDA graph captureFixed-shape inferenceZero kernel launch overhead

NVFP4 Quantization

inference/server.py
Production

4-bit floating point quantization for RTX 5090 Blackwell architecture via torchao. Per-block scaling gives ~2x memory reduction and faster matmuls on SM100+ GPUs.

NVFP4 weight-only quantizationPer-block scaling (block_size=32)torchao integrationRTX 5090 Blackwell SM100+

Kernel Fusion

cutechronos/, cutezimage/
Production

Fuse multiple operations into single Triton/CUDA kernels to eliminate intermediate tensor allocations and reduce memory bandwidth pressure. Core technique across all CuteDSL modules.

LayerNorm + Linear fusionSiLU + Gate multiply fusionAdaLN + Normalization fusionRoPE + QKV fusionPreprocessing pipeline fusion

Vectorized CUDA

cutezimage/csrc/
Production

128-bit vectorized loads and stores using bfloat162, half2, and float4 types. Maximizes memory bandwidth utilization on modern GPUs.

128-bit vectorized loads/storesWarp-level reductionsShared memory optimizationsCoalesced access patterns

stable-diffusion.cpp

diffusionz/
In Progress

Pure C++ stable diffusion inference using GGML-style quantization. Goal: run diffusion models on CPU or with minimal GPU, similar to how llama.cpp democratized LLM inference. DiffusionZ C API (diffusionz_c_api.h) is implemented with LibTorch backend; evaluating stable-diffusion.cpp as an alternative lighter backend.

GGML-style quantization (Q4, Q8)CPU inference pathMetal/CUDA backendsDiffusionZ C API (implemented)LibTorch C++ backendCGO bridge to Go server

Model Architectures

Internal structure of the models we accelerate

CuteChronos2 Pipeline

Input → InstanceNorm → Patching
→ InputPatchEmbedding
→ [EncoderBlock × 12]
TimeSelfAttention (RoPE) + GroupSelfAttention + FeedForward
→ FinalLayerNorm → OutputPatchEmbedding
→ Inverse InstanceNorm → Quantile Predictions

CuteZImage Transformer

Text Encoding → Timestep Embed → Patch Embed
→ [TransformerBlock × 30]
AdaLN + Attention (Complex RoPE) + SiLU-Gated FFN
→ [RefinerBlock × 2]
→ Final Norm → Unpatchify → VAE Decode