Huge release from ByteDance Doubao.
This is indeed a big release.
Deepseek is an advanced AI lab devoted to frontier research, real scientific research, i.e. research with corresponding publications and disclosures. It is what Open AI was in the first years, when their name still made sense...before Altman completely turned the company toward a highly commercial and profit-oriented business.
But Deepseek will never be the Google or Meta of China, nor it's their goal to be.
Instead Bytedance will be.
Reading their announcement, I'd like to highlight this improved
speech capability that it seems secondary compared to "thinking", but actually it is not.
In terms of speech multi-modality, we have proposed a new Speech2Speech end-to-end framework, which not only deeply integrates speech and text modalities through native methods, but also achieves true end-to-end speech understanding and generation in speech conversations. , compared with the traditional ASR+LLM+TTS cascade method, there is a qualitative leap in the dialogue effect.
The traditional speech integration in LLM is done with 2 independent extra modules: a "speech to text" (ASR) module that listen and converts the speech into text that the LLM reads, and a "text to speech" (TTS) module that picks the
text output of the LLM and convert it into speech. This is the simplest technical approach, but for the LLM model it is like to interface with people by mean of reading and writing text on a mobile: all the non textual information conveyed by the tone, the speed, the mood is lost in both directions.
Instead they got rid of the 2 external modules and made the LLM to directly listen and produce "sound" tokens. We know that words are converted into tokens for LLM to process. Here the sound tokens are just representation of very short (some tens milliseconds) chunks of sound that all toghter form the input voice stream. These sound tokens are very close to the actual sounds more than to the words that the sound voice represent, so that the LLM can understand not only the words but also how the words are spoken. And the same goes for the output: the LLM is free to choose how the words should be modulated and spelt. Moreover the latency (i.e. the pauses) is greatly reduced because the data processing needed to transform voice into sound tokens is very thin and simple compared to the traditional speech-to-text module. All the heavy duty of inferring words out of sounds is done by the LLM itself, that's why this approach is more difficult.
Although there is quite a lot of literature and prototypes on this approach, as of today only OpenAI and Google have production quality implementations. OpenAI since June last year, Google since few months ago. Now also Bytedance.
This feature is a key enabler to make AI succesful in the mass consumer market.
Here is some video material on this new feature by Bytedance: