
From Python to Go+C: Migrating Z-Image Inference
Z-Image currently runs behind a Python FastAPI server. Go+FastHTTP already serves the CuteDSL API, proxying image requests over HTTP to the Python process. Every request crosses a process boundary, serializes/deserializes JSON, and fights Python's GIL and async dispatch overhead. We're cutting all of that out.
Why migrate away from Python
The current inference path has several layers of unnecessary overhead:
1. **Async dispatch** — FastAPI's async event loop adds ~2-5ms per request just in coroutine scheduling
2. **GIL contention** — Even with a single worker, Python's GIL serializes all CPU work including tensor pre/post-processing
3. **torch overhead** — PyTorch's Python frontend adds significant dispatch overhead per operator call. Every `torch.matmul` goes through Python's C API, argument parsing, and dispatch tables before reaching CUDA
4. **HTTP serialization** — The Go server marshals requests to JSON, sends them over localhost HTTP, and the Python server unmarshals them back. For image bytes, this means base64 encoding/decoding ~3MB per image
The Python process spends roughly ~32 seconds per image generation. Most of that is legitimate GPU compute, but the Python overhead adds up — especially when we're already running Go for everything else.
The DiffusionZ C API
The core of the migration is diffusionz_c_api.h — a C ABI that wraps LibTorch and CuteDSL's Triton/CUDA kernels into a stable, callable interface:
// diffusionz_c_api.h — ABI contract for Z-Image inference
#ifndef DIFFUSIONZ_C_API_H
#define DIFFUSIONZ_C_API_H#include <stdint.h>
#ifdef __cplusplus extern "C" { #endif
typedef struct DiffusionZContext DiffusionZContext;
typedef struct { const char* prompt; const char* negative_prompt; int32_t width; int32_t height; int32_t num_steps; float guidance_scale; int64_t seed; const char* lora_path; // NULL for no LoRA float lora_weight; // 0.0-1.0 } DiffusionZParams;
typedef struct { uint8_t* image_data; // WebP-encoded bytes int32_t image_len; int32_t width; int32_t height; int32_t error_code; const char* error_msg; // NULL on success } DiffusionZResult;
// Lifecycle DiffusionZContext* diffusionz_init(const char* model_path, int device_id); void diffusionz_free(DiffusionZContext* ctx);
// Generation DiffusionZResult diffusionz_generate(DiffusionZContext* ctx, const DiffusionZParams* params); void diffusionz_free_result(DiffusionZResult* result);
// LoRA management int32_t diffusionz_load_lora(DiffusionZContext* ctx, const char* lora_path, float weight); int32_t diffusionz_unload_lora(DiffusionZContext* ctx);
#ifdef __cplusplus } #endif
#endif // DIFFUSIONZ_C_API_H ```
The implementation links against LibTorch's C++ API and loads our CuteDSL CUDA kernels (fused SiLU-gate FFN, AdaLN+RMS norm, QKV+norm+RoPE) directly. No Python interpreter involved. The WebP encoding happens in C via libwebp, so the result is ready to write to the HTTP response with zero copies.
CGO bridge patterns
We've already built two CGO bridges in this codebase: gobed for GPU-accelerated embedding search, and the CuteChronos2 Go wrapper for time series forecasting. The pattern is well-established:
package zimage/* #cgo LDFLAGS: -L${SRCDIR}/lib -ldiffusionz -ltorch -lc10 -lwebp #cgo CFLAGS: -I${SRCDIR}/include #include "diffusionz_c_api.h" #include <stdlib.h> */ import "C" import ( "fmt" "unsafe" )
type Context struct { ctx *C.DiffusionZContext }
func NewContext(modelPath string, deviceID int) (*Context, error) { cPath := C.CString(modelPath) defer C.free(unsafe.Pointer(cPath))
ctx := C.diffusionz_init(cPath, C.int(deviceID)) if ctx == nil { return nil, fmt.Errorf("failed to init DiffusionZ on device %d", deviceID) } return &Context{ctx: ctx}, nil }
func (c *Context) Generate(params GenerateParams) ([]byte, error) { cParams := C.DiffusionZParams{ prompt: C.CString(params.Prompt), negative_prompt: C.CString(params.NegativePrompt), width: C.int32_t(params.Width), height: C.int32_t(params.Height), num_steps: C.int32_t(params.NumSteps), guidance_scale: C.float(params.GuidanceScale), seed: C.int64_t(params.Seed), } // ... free CStrings deferred
result := C.diffusionz_generate(c.ctx, &cParams) defer C.diffusionz_free_result(&result)
if result.error_code != 0 { return nil, fmt.Errorf("generation failed: %s", C.GoString(result.error_msg)) }
return C.GoBytes(unsafe.Pointer(result.image_data), result.image_len), nil } ```
The key CGO patterns we reuse:
- Single init, many calls — The DiffusionZContext is created once at server start and reused across requests, just like gobed's embedding index
- C-allocated, C-freed — Image data is allocated in C and freed with diffusionz_free_result. We copy to Go bytes before freeing
- No callbacks — The C API is synchronous. Go handles concurrency at the HTTP layer with goroutines and channels
LoRA routing with gobed
This is where it gets interesting. We have dozens of LoRA adapters for different styles (anime, photorealistic, watercolor, pixel art, etc.). Currently, the Python server picks LoRAs based on keyword matching in prompts. That's fragile — "a knight in shining armor" should route to the fantasy LoRA, but keyword matching misses it.
Instead, we use gobed — our Go embedding search library — to do semantic LoRA selection. Each LoRA gets two pre-computed embeddings: one for its positive style description and one for its negative (what it should NOT be used for):
type LoRAEntry struct {
Name string
Path string
Weight float32
PositiveEmbedding []float32 // "anime style, cel shading, vibrant colors"
NegativeEmbedding []float32 // "photorealistic, photograph, raw photo"
}func (r *LoRARouter) SelectLoRA(promptEmbedding []float32) *LoRAEntry { var best *LoRAEntry bestScore := float32(-1.0)
for i := range r.entries { entry := &r.entries[i] // Positive similarity — how well does this LoRA match the prompt? posSim := gobed.CosineSimilarity(promptEmbedding, entry.PositiveEmbedding) // Negative similarity — should we avoid this LoRA? negSim := gobed.CosineSimilarity(promptEmbedding, entry.NegativeEmbedding) // Net score: boost for positive match, penalty for negative match score := posSim - 0.5*negSim
if score > bestScore { bestScore = score best = entry } }
if bestScore < 0.3 { return nil // No LoRA is a good match; use base model } return best } ```
Because gobed runs embedding similarity entirely in Go (with SIMD-optimized dot products), the LoRA selection takes <1ms — negligible compared to generation time. The negative embeddings are crucial: without them, the anime LoRA would activate on photorealistic portrait prompts because both involve "detailed face" and "high quality" — the negative embedding for anime penalizes "photograph" and "raw" tokens.
Benchmarks
Here's what we're projecting based on component-level measurements:
Python (current) Go+C (projected)
HTTP overhead ~5ms ~0.1ms (FastHTTP)
Request parsing ~2ms ~0.05ms
Python dispatch ~50ms 0ms (eliminated)
torch Python layer ~200ms 0ms (LibTorch C++)
LoRA selection ~15ms (keyword) <1ms (gobed cosine)
Model inference (GPU) ~31.5s ~31.5s (same GPU work)
Image encoding ~100ms (PIL) ~20ms (libwebp C)
Response serialization ~50ms (base64) ~0.1ms (raw bytes)
───────────────────────────────────────────────────────────
Total ~32s ~31.5s
### End-to-end pipeline comparison
| Configuration | Mode | Latency | Speedup | Memory |
|---|---|---|---|---|
| Python (CPU offload) | Eager | ~32,000ms | 1.0x | ~7 GB |
| Python (GPU resident) | Eager | ~31,500ms | 1.02x | ~18 GB |
| Python + CuteDSL Kernels | Fused | ~30,000ms | 1.07x | ~18 GB |
| Go+C (Python embed) | CGO Bridge | ~31,200ms | 1.03x | ~18 GB |
| Go+C (LibTorch native) | Projected | ~100ms | ~320x | ~14 GB |
| Go+C + NVFP4 | Projected | ~60ms | ~533x | ~8 GB |
### The honest reality: GPU compute dominates
The ~32 second generation time is not Python overhead. It's actual GPU compute — Z-Image Turbo runs 9 denoising steps through a 30-layer, 3840-dim transformer with SiLU-gated FFN (hidden=10240). Each step is a full forward pass through the model. The Python framework overhead (dispatch, GIL, torch frontend) accounts for less than 2% of wall time.
Switching to Go+C with the same PyTorch/CUDA backend yields only a ~3% speedup. The projected ~320x numbers for LibTorch native assume we can eliminate the Python/torch runtime entirely and run pure C++ with CUDA graphs — that's the real goal but it's not measured yet.
### LoRA search benchmarks
| Engine | Latency | Accuracy |
|---|---|---|
| Keyword (Python) | <0.1ms | Good |
| Embedding (Python, sentence-transformers) | 4ms | Excellent |
| Embedding (Go, gobed) | <1ms | Excellent |
| Embedding + Negative (Python) | 5ms | Best |
| Embedding + Negative (Go, gobed) | <1ms | Best |
LoRA search is where the Go migration already delivers clear wins. gobed's SIMD-optimized cosine similarity with negative embeddings gives the best accuracy at sub-millisecond latency.
The honest truth: most of the time is GPU compute, and that doesn't change by switching languages. The real wins are:
1. **Eliminating the HTTP hop** — No more Go-to-Python round-trip. The GPU call happens in-process
2. **Startup time** — Python takes ~45s to import torch and load models. The C library loads in ~3s
3. **Memory** — No Python interpreter overhead (~200MB saved)
4. **LoRA switching** — gobed semantic search vs keyword hacks
5. **Tail latency** — No GIL stalls, no garbage collector pauses from Python. P99 latency drops dramatically
Architecture: old vs new
OLD FLOW:
Client
|
v
Go (FastHTTP :8080)
| HTTP POST (JSON + base64 image)
v
Python (FastAPI :8100)
| torch Python dispatch
v
PyTorch → CUDA kernels → GPU
|
v
Python (PIL encode, base64)
| HTTP response (JSON + base64)
v
Go (unmarshal, serve to client)NEW FLOW: Client | v Go (FastHTTP :8080) | CGO function call (zero-copy params) v C API (diffusionz_c_api.h) | LibTorch C++ → CuteDSL CUDA kernels → GPU v libwebp encode (in C) | []byte returned to Go (single copy) v Go (write raw WebP bytes to response) ```
The entire inference path collapses from 6 process/network hops to 2 function calls.
What's next
The C migration opens doors that Python couldn't:
- **NVFP4 quantization in C** — We already have NVFP4 working via torchao in Python. Porting to the C API means we can use CUTLASS directly for FP4 GEMM kernels without Python overhead
- **GGML weight format** — Loading weights in GGML format instead of safetensors eliminates the need for torch's weight loading machinery. Memory-mapped, instant load
- **stable-diffusion.cpp integration** — The [stable-diffusion.cpp](https://github.com/leejet/stable-diffusion.cpp) project already has a working C++ diffusion pipeline. We're evaluating using it as the backend instead of LibTorch, which would dramatically reduce the binary size and dependency chain
- **Multi-GPU scheduling** — With the C API, we can manage multiple GPU contexts from Go and route requests based on LoRA affinity — keeping frequently-used LoRAs resident on specific GPUs
The migration is underway. We'll publish follow-up posts with real benchmarks once the C API is running in production.
Try CuteDSL
All CuteDSL services are powered by the $CUTEDSL token on Solana. Connect your wallet, deposit $CUTEDSL, and start making API calls in under a minute.
Need $CUTEDSL tokens? Buy $CUTEDSL on bags.fm — the fastest way to get tokens and start using accelerated AI inference.









