Model Selection¶
Date checked: 2026-05-13.
Recommendation¶
Start with faster-whisper and CPU int8 presets. Keep the default on small for low-resource smoke tests, then benchmark large-v3-turbo as the first serious Whisper-family step.
Current OpenAI Whisper version check:
- latest official OpenAI open-source Whisper checkpoint:
openai/whisper-large-v3-turbo; - Hugging Face created date: 2024-10-01;
- Hugging Face last modified date: 2024-10-04;
- base model:
openai/whisper-large-v3; - OpenAI's own
whisperrepository currently aliasesturbotolarge-v3-turbo.
There does not appear to be a newer official OpenAI open-source Whisper checkpoint from 2025 or 2026. Newer ASR models exist, but they are not OpenAI Whisper checkpoints. The small preset remains in this project only as the legacy CPU baseline.
This is not the absolute newest ASR architecture, but it is the best first library backend for this project because:
- it runs on CPU with quantization;
- it supports English, Spanish, and Catalan through Whisper's multilingual coverage;
- it has mature Python packaging;
- it exposes segment timestamps;
- Catalan can be forced with
language="ca"; - stronger Catalan-specific Whisper checkpoints can be converted into CTranslate2 later.
Newer models checked¶
Whisper large-v3-turbo¶
Best current official OpenAI Whisper-family candidate for this library because it keeps Whisper's broad multilingual coverage, including English, Spanish, and Catalan, while being much faster than full large-v3. The Faster-Whisper package maps large-v3-turbo to mobiuslabsgmbh/faster-whisper-large-v3-turbo, so it works with the existing backend.
Sources:
- https://huggingface.co/openai/whisper-large-v3-turbo
- https://huggingface.co/mobiuslabsgmbh/faster-whisper-large-v3-turbo
OpenVINO Whisper large-v3-turbo INT4¶
OpenVINO INT4 is the smallest current CPU experiment for the latest official OpenAI Whisper turbo checkpoint. It uses a separate OpenVINO backend, not Faster-Whisper. After local testing, it is removed from the V1 production scope because it adds a second backend path and behaved awkwardly on longer clips. Keep this section as historical research, not as an implementation target.
Sources:
Cohere Transcribe 03-2026¶
Very strong 2026 open-source ASR model and current English leaderboard leader in Cohere's published results. Not the default here because it supports Spanish and English, but not Catalan. It is also a 2B model and expects a language tag, with limitations around automatic language detection, code-switching, timestamps, and diarization.
Source: https://huggingface.co/CohereLabs/cohere-transcribe-03-2026
Qwen3-ASR¶
Strong 2026 open-source ASR family with 1.7B and 0.6B variants, streaming/offline modes, language identification, and broad language coverage. Not the default here because the published supported language list includes Spanish and English, but not Catalan.
Source: https://huggingface.co/Qwen/Qwen3-ASR-1.7B
IBM Granite Speech 4.x¶
Modern Apache-2.0 speech-language models with strong leaderboard results. Not the default here because the supported languages include English, French, German, Spanish, Portuguese, and Japanese, but not Catalan.
Sources:
- https://huggingface.co/ibm-granite/granite-4.0-1b-speech
- https://huggingface.co/ibm-granite/granite-speech-4.1-2b
NVIDIA Canary / Parakeet¶
Canary 1B v2 covers 25 European languages and has useful punctuation, capitalization, timestamps, and translation features. Its published supported language list includes Spanish and English, but not Catalan. NVIDIA's own docs also position these models around NVIDIA GPU stacks, so they are not the first CPU-only library backend.
Sources:
- https://huggingface.co/nvidia/canary-1b-v2
- https://docs.nvidia.com/nemo/speech/nightly/starthere/choosing_a_model.html
Catalan-specific candidates¶
BSC-LT/whisper-large-v3-LoS¶
Best-looking Spanish/Catalan joint candidate. It is fine-tuned from openai/whisper-large-v3 for Spanish, Catalan, Galician, and Basque on 8,110 hours and is Apache-2.0. It should be benchmarked after conversion to CTranslate2 before becoming a production preset.
Source: https://huggingface.co/BSC-LT/whisper-large-v3-LoS
BSC-LT/whisper-large-v3-ca-punctuated-3370h¶
Catalan-only Whisper large-v3 fine-tune trained on about 3,370 hours with punctuation/capitalization. Good quality candidate, but large and not the first CPU default.
Source: https://huggingface.co/BSC-LT/whisper-large-v3-ca-punctuated-3370h
BSC-LT/whisper-bsc-large-v3-cat¶
Catalan-only Whisper large-v3 fine-tune trained on 4,700 hours. It outputs plain text without punctuation. Strong candidate when raw Catalan WER matters more than punctuation.
Source: https://huggingface.co/BSC-LT/whisper-bsc-large-v3-cat
Benchmark plan¶
Use 50-100 real clips before promoting any model:
- 15-60 seconds each;
- Spanish, Catalan, English;
- phone/laptop microphones;
- silence and background noise cases;
- names, product vocabulary, and mixed accents;
- short Catalan clips where language detection is likely to fail.
Measure:
- wall-clock latency;
- peak RSS;
- manual error rate or WER on a labelled subset;
- punctuation/capitalization quality;
- hallucination on silence;
- whether forcing
caoresimproves output.