Model Selection

Date checked: 2026-05-13.

Recommendation

Start with faster-whisper and CPU int8 presets. Keep the default on small for low-resource smoke tests, then benchmark large-v3-turbo as the first serious Whisper-family step.

Current OpenAI Whisper version check:

  • latest official OpenAI open-source Whisper checkpoint: openai/whisper-large-v3-turbo;
  • Hugging Face created date: 2024-10-01;
  • Hugging Face last modified date: 2024-10-04;
  • base model: openai/whisper-large-v3;
  • OpenAI's own whisper repository currently aliases turbo to large-v3-turbo.

There does not appear to be a newer official OpenAI open-source Whisper checkpoint from 2025 or 2026. Newer ASR models exist, but they are not OpenAI Whisper checkpoints. The small preset remains in this project only as the legacy CPU baseline.

This is not the absolute newest ASR architecture, but it is the best first library backend for this project because:

  • it runs on CPU with quantization;
  • it supports English, Spanish, and Catalan through Whisper's multilingual coverage;
  • it has mature Python packaging;
  • it exposes segment timestamps;
  • Catalan can be forced with language="ca";
  • stronger Catalan-specific Whisper checkpoints can be converted into CTranslate2 later.

Newer models checked

Whisper large-v3-turbo

Best current official OpenAI Whisper-family candidate for this library because it keeps Whisper's broad multilingual coverage, including English, Spanish, and Catalan, while being much faster than full large-v3. The Faster-Whisper package maps large-v3-turbo to mobiuslabsgmbh/faster-whisper-large-v3-turbo, so it works with the existing backend.

Sources:

OpenVINO Whisper large-v3-turbo INT4

OpenVINO INT4 is the smallest current CPU experiment for the latest official OpenAI Whisper turbo checkpoint. It uses a separate OpenVINO backend, not Faster-Whisper. After local testing, it is removed from the V1 production scope because it adds a second backend path and behaved awkwardly on longer clips. Keep this section as historical research, not as an implementation target.

Sources:

Cohere Transcribe 03-2026

Very strong 2026 open-source ASR model and current English leaderboard leader in Cohere's published results. Not the default here because it supports Spanish and English, but not Catalan. It is also a 2B model and expects a language tag, with limitations around automatic language detection, code-switching, timestamps, and diarization.

Source: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026

Qwen3-ASR

Strong 2026 open-source ASR family with 1.7B and 0.6B variants, streaming/offline modes, language identification, and broad language coverage. Not the default here because the published supported language list includes Spanish and English, but not Catalan.

Source: https://huggingface.co/Qwen/Qwen3-ASR-1.7B

IBM Granite Speech 4.x

Modern Apache-2.0 speech-language models with strong leaderboard results. Not the default here because the supported languages include English, French, German, Spanish, Portuguese, and Japanese, but not Catalan.

Sources:

NVIDIA Canary / Parakeet

Canary 1B v2 covers 25 European languages and has useful punctuation, capitalization, timestamps, and translation features. Its published supported language list includes Spanish and English, but not Catalan. NVIDIA's own docs also position these models around NVIDIA GPU stacks, so they are not the first CPU-only library backend.

Sources:

Catalan-specific candidates

BSC-LT/whisper-large-v3-LoS

Best-looking Spanish/Catalan joint candidate. It is fine-tuned from openai/whisper-large-v3 for Spanish, Catalan, Galician, and Basque on 8,110 hours and is Apache-2.0. It should be benchmarked after conversion to CTranslate2 before becoming a production preset.

Source: https://huggingface.co/BSC-LT/whisper-large-v3-LoS

BSC-LT/whisper-large-v3-ca-punctuated-3370h

Catalan-only Whisper large-v3 fine-tune trained on about 3,370 hours with punctuation/capitalization. Good quality candidate, but large and not the first CPU default.

Source: https://huggingface.co/BSC-LT/whisper-large-v3-ca-punctuated-3370h

BSC-LT/whisper-bsc-large-v3-cat

Catalan-only Whisper large-v3 fine-tune trained on 4,700 hours. It outputs plain text without punctuation. Strong candidate when raw Catalan WER matters more than punctuation.

Source: https://huggingface.co/BSC-LT/whisper-bsc-large-v3-cat

Benchmark plan

Use 50-100 real clips before promoting any model:

  • 15-60 seconds each;
  • Spanish, Catalan, English;
  • phone/laptop microphones;
  • silence and background noise cases;
  • names, product vocabulary, and mixed accents;
  • short Catalan clips where language detection is likely to fail.

Measure:

  • wall-clock latency;
  • peak RSS;
  • manual error rate or WER on a labelled subset;
  • punctuation/capitalization quality;
  • hallucination on silence;
  • whether forcing ca or es improves output.