๐Ÿ“ฐ Story

simon_willison ยท Apr 27, 2026 ยท news

โ† Live feed ๐Ÿ“ฐ Daily recap ๐Ÿ—“๏ธ Weekly recap ๐Ÿ”” RSS

microsoft/VibeVoice

microsoft/VibeVoice VibeVoice is Microsoft's Whisper-style audio model for speech-to-text, MIT licensed and with speaker diarization built into the model. Microsoft released it on January 21st, 2026 but I hadn't tried it until today. Here's a one-liner to run it on a Mac with uv , mlx-audio (by Prince Canuma) and the 5.71GB mlx-community/VibeVoice-ASR-4bit MLX conversion of the 17.3GB VibeVoice-ASR model, in this case against a downloaded copy of my recent podcast appearance with Lenny Rachitsky : uv run --with mlx-audio mlx_audio.stt.generate \ --model mlx-community/VibeVoice-ASR-4bit \ --audio lenny.mp3 --output-path lenny \ --format json --verbose --max-tokens 32768 The tool reported back: Processing time: 524.79 seconds Prompt: 26615 tokens, 50.718 tokens-per-sec Generation: 20248 tokens, 38.585 tokens-per-sec Peak memory: 30.44 GB So that's 8 minutes 45 seconds for an hour of audio (running on a 128GB M5 Max MacBook Pro). I've tested it against .wav and .mp3 files and they both worked fine. If you omit --max-tokens it defaults to 8192, which is enough for about 25 minutes of audio. I discovered that through trial-and-error and quadrupled it to guarantee I'd get the full hour. That command reported using 30.44GB of RAM at peak, but in Activity Monitor I observed 61.5GB of usage during the prefill stage and around 18GB during the generating phase. Here's the resulting JSON . The key structure looks like this: { "text": "And an open question for me is how many other knowled

Read the original at simonwillison.net โ†’Open in live feed

Related stories 4 items