Blog
Engineering notes
Deep dives from the team building OpenAlchemy — inference internals, model optimization, and the open-source work that powers the platform.
- EngineeringJune 9, 20267 min read
Whisper on RTX 50-series: our whisper.cpp fork and a production STT layer
Upstream whisper.cpp crashes on Blackwell GPUs. We fixed it — then wrapped Whisper in streaming, word-level timestamps, cancellation, and an OpenAI-compatible transcription API.
- EngineeringMay 30, 20269 min read
TurboQuant comes to llama.cpp: 2-bit and 3-bit KV cache compression
We forked llama.cpp to bring Google Research's TurboQuant to the KV cache — shrinking it ~5× at 3-bit while staying quality-neutral, with measured wins on Qwen.