Looks like MiniCPM launch was overshadowed by the launch of DeepSeek
A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
MiniCPM-o 2.6
MiniCPM-o 2.6 is the latest and most capable model in the MiniCPM-o series. The model is built in an end-to-end fashion based on SigLip-400M, Whisper-medium-300M, ChatTTS-200M, and Qwen2.5-7B with a total of 8B parameters. It exhibits a significant performance improvement over MiniCPM-V 2.6, and introduces new features for real-time speech conversation and multimodal live streaming. Notable features of MiniCPM-o 2.6 include:
-Leading Visual Capability. MiniCPM-o 2.6 achieves an average score of 70.2 on OpenCompass, a comprehensive evaluation over 8 popular benchmarks.
-State-of-the-art Speech Capability. MiniCPM-o 2.6 supports
bilingual real-time speech conversation with configurable voices in English and Chinese.
-Strong Multimodal Live Streaming Capability. As a new feature, MiniCPM-o 2.6 can
accept continous video and audio streams independent of user queries, and support real-time speech interaction.
-Strong OCR Capability and Others. Advancing popular visual capabilites from MiniCPM-V series, MiniCPM-o 2.6 can process images with any aspect ratio and up to 1.8 million pixels (e.g., 1344x1344).
-Superior Efficiency. In addition to its friendly size, MiniCPM-o 2.6 also shows
state-of-the-art token density (i.e., number of pixels encoded into each visual token).
-Easy Usage. MiniCPM-o 2.6 can be easily used in various ways: (1) llama.cpp support for efficient CPU inference on local devices, (2) int4 and GGUF format quantized models in 16 sizes, (3) vLLM support for high-throughput and memory-efficient inference, (4) fine-tuning on new domains and tasks with LLaMA-Factory, (5) quick local WebUI demo setup with Gradio, and (6) online web demo on server.