![]() ![]() So, sure, someone should continue working on putting the pieces together, and I'm sure the notebook in the discussion I linked has probably improved since then, but I think pyannote itself needs some improvement first. It did nothing to address the latter half of the audio. There is an option to force the minimum number of speakers higher, and that just caused it to split some of the earlier speakers into multiple labels. It would identify some of the earlier speakers, but then it felt like it lost attention and would just start labeling every new speaker as the same speaker. I tried it against an hour-ish audio sample, and I found no way to tune pyannote to keep track of ~10 speakers over the course of that audio. After spending an evening improving it, I got everything "working", but pyannote was not reliable. I tried out this notebook about a month ago, and it was rough. It's not as if people aren't trying to do that: Not that much publicity for the former though. But not much better than Nvidia Nemo for example. So on mobile Vosk is still a viable option actually as many others mobile-focused engines.įor server based transcription Whisper is certainly better. It also can use context properly improving for long recordings. Whisper is multilingual and has punctuation, that is a clearly a good advantage. Latency is very important for interactive applications like assistants. If you refactor Whisper for lower latency you will loose a lot of accuracy advantage. Whisper requires you to wait for significant amount of time. Vosk is streaming design, you get results with minimum latency of 200ms. Whisper.cpp recommended here is actually 10% worse than stock Whisper for speed considerations. Not sure how important is it, I would claim it is not that critical. Vosk runs on single core while Whisper needs all CPU cores.Īccuracy difference for clean speech between Vosk-small and Whisper tiny is 2-3% absolute, 20% relative. In general, Whisper is more accurate but much more resource heavy. We did comparison of recent Vosk and Whisper models here: Thanks for providing the Slim Shady examples, as a life-long hip hop enthusiast I'm not offended by the content in the slightest. The other issue is if this is being made available to the public I doubt I'd be comfortable without front-ending the entire thing with Cloudflare (or similar) and Cloudflare (and others) have things like 100s timeouts for final HTTP response (Websockets could get around this). I'm front-ending whisper-asr-webservice with traefik so I could certainly do something like having two separate instances (one for base, another for large) at different URL paths but like I said I need to do some playing around with it. OPs node+express implementation shells out to Whisper which gives more control (like runtime specification of model) but almost certainly has to end up slower and less efficient in the long run as the model is obviously loaded from scratch on each invocation. Default was two workers, is now one but I've found as many as four with base can really improve overall utilization, response time, and scale (which certainly won't be possible with large). I have some prior experience with Gunicorn and other WSGI implementations so there's some playing around and benchmarking to be done on the configured number of workers as the GPU utilization of Whisper is a little spiky and whisper-asr-webservice does file format conversion on CPU via ffmpeg. ![]() Good to know - I've tried large and it works but in my case I'm using whisper-asr-webservice which loads the configured model for each of the workers on startup. Hah, I love that - "benchmark by fan speed". ![]()
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |