

Our focus is primarily "real time" dictation tasks with ~10 sec sentences. That's 72x real-time with a much larger, more capable, and more accurate model - and we have further work to do. That's 6.8x real-time.Īs an example, we've spent quite a bit of time optimizing our self-hosted Whisper API endpoint and it can do 3 min of audio (the max we currently care about) in 2.5 seconds with large-v2 and beam size five on an RTX 3090. Depending on expectations vs the GPU powered OpenAI Whisper endpoint it could be disappointing.įrom the whisper.cpp benchmarks it's showing transcribing 3:24 of audio with whisper medium.en in 30 seconds (on an M1 Pro!!!) - which is (again) incredible considering.

It would be straightforward enough to create an API utilizing whisper.cpp but I'm not aware that such a thing exists.Īdditionally, depending on requirements whisper.cpp is remarkably performant considering it's running on CPU but it's still nowhere near competitive with GPU implementations (for obvious reasons). OP said "self host" so I assumed they're looking for an implementation that provides an API endpoint. Yes but it doesn't provide an HTTP/whatever API - it's CLI. So yeah, if you're rich $700 a year is not a big deal, but. In many places $400 a month is a really good salary. That'd be cool.īut, the cost you propose is way too much for most people, especially in countries that aren't rich. Maybe someone will extract that code and let us combine the MP3 and timestamped text file in a web site (if that doesn't already exist). It'd be a massive upgrade, but that's not what this is offering. Would it be good for podcasts to use an interface like this for playback? absolutely. Unless you really want to hear what you just read, there's not a lot of added value. With spoken text (what this is best at) you click and go to the point.where they're saying what you just read. you can click the text and go to teh right place in the video. The text line being highlighted while you listen is nice but a) we wrote something that did it at the word level (as opposed to sentence.ish level) nearly 20 years ago, b) in this context it's not actually that useful. unless you're really afraid of the command line it's not that much more convenient. Yes, it's a nicer interface, but the current state of the "geeky" version is type command on command line, with path to file. Why would you want to pay nearly $700 a year just to avoid running a program in the background on whatever computer you already have open?
