EngineeringJune 9, 20267 min read

Whisper on RTX 50-series: our whisper.cpp fork and a production STT layer

Upstream whisper.cpp crashes on Blackwell GPUs. We fixed it — then wrapped Whisper in streaming, word-level timestamps, cancellation, and an OpenAI-compatible transcription API.

We run a lot of speech-to-text on the newest NVIDIA hardware, and we hit a wall that anyone with an RTX 50-series card will recognize: whisper.cpp doesn't start. Not “runs slowly” — it aborts the process during CUDA initialization, before a single second of audio is decoded. So we forked it. Our fork lives at github.com/openalchemy/whisper.cpp, and this post covers the crash we fixed and the production transcription layer we built around it.

The Blackwell crash

On RTX 50-series GPUs (Blackwell, compute capability sm_120), the CUDA backend can report a device slot that is still only partially enumerated. When that happens, ggml_backend_reg_dev_get() returns NULL. Upstream pushes that pointer into its device list unconditionally, and the next caller that walks the list trips an assertion and kills the process:

// Upstream: a partially-enumerated Blackwell (sm_120) slot returns NULL,
// the NULL is stored, and a later GGML_ASSERT(device) aborts the process.
ggml_backend_dev_t dev = ggml_backend_reg_dev_get(reg, i);
devices.push_back(dev);            // NULL goes in → crash during backend init

// OpenAlchemy fork: guard the empty slot and keep enumerating.
ggml_backend_dev_t dev = ggml_backend_reg_dev_get(reg, i);
if (!dev) continue;                // skip the partial slot
devices.push_back(dev);

The fix is small and surgical — a handful of NULL guards in ggml-backend-reg.cpp and the backend-init paths of whisper.cpp. With them, the CUDA backend enumerates cleanly, skips the empty slot, and decoding runs to completion. We validated it on an RTX 5080 (16 GB) with driver 591.86 and CUDA 13.2: without the patch every call asserts during init; with it, transcription just works. The fork tracks upstream whisper.cpp v1.8.6, so you get the latest engine plus the one patch that makes it boot on current GPUs.

From “it runs” to a real API

Getting Whisper to start is the floor, not the ceiling. Around the forked engine we built a runtime layer that turns it into a transcription service you can actually ship against — streaming, word-level timing, an OpenAI-compatible response shape, and clean cancellation. Here is what the OpenAlchemy STT runtime adds on top of stock whisper.cpp:

CapabilityStock whisper.cppOpenAlchemy

Runs on RTX 50-series (Blackwell / sm_120)—✓

Per-segment streaming results—✓

Word-level timestamps over the API—✓

OpenAI-compatible verbose_json output—✓

Per-request cancellation—✓

Reliable language=auto detection—✓

Usage accounting (audio seconds)—✓

A couple of these are bug fixes as much as features. The biggest: language=auto used to return zero segments — the language-detection path and the decode path were fighting each other, so auto-detect requests silently produced empty transcripts. That is fixed, and language detection now runs implicitly on the decode path.

What you get back

Transcriptions come back in the same verbose_json shape the OpenAI Audio API uses, so existing clients work unchanged — segments, detected language, duration, and (when you ask for them) per-word timestamps, plus a usage field for billing:

{
  "task": "transcribe",
  "language": "ja",
  "duration": 12.84,
  "text": "...",
  "segments": [
    { "id": 0, "start": 0.00, "end": 3.20, "text": "..." }
  ],
  "words": [
    { "word": "こんにちは", "start": 0.00, "end": 0.48 }
  ],
  "usage": { "audio_seconds": 12.84 }
}

Calling it is a standard multipart request against /v1/audio/transcriptions (translation lives at /v1/audio/translations):

curl https://api.openalchemy.io/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENALCHEMY_API_KEY" \
  -F model=whisper-large-v3 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word" \
  -F file=@meeting.m4a

Running it locally

Whisper ships as a Speech-to-Text runtime in the OpenAlchemy Engine desktop app (v0.5.1). Pick the Whisper engine under Speech-to-Text, and the same fork that survives Blackwell init powers transcription on your own GPU — the build is the reason a 50-series card can serve STT jobs at all.

The patch is open source: read the guards and the backend-init changes at github.com/openalchemy/whisper.cpp. If you are on a 50-series card and whisper.cpp has been asserting on you, this is the one-line reason why — and the fix.

A note on honesty: this release is a stability and integration story, not a speed claim. We don't have published RTF or accuracy benchmarks for the fork yet, so we're not quoting any — when we have measured numbers, they'll get their own post.

The Blackwell crash

From “it runs” to a real API

What you get back

Running it locally

Run open models with less VRAM