Home / Technology / AI Sees, Reasons, and Acts: New Vision Model Unveiled
AI Sees, Reasons, and Acts: New Vision Model Unveiled
9 Dec
Summary
- New open-source vision-language model enhances multimodal reasoning.
- Introduces native function calling for direct tool integration.
- Achieves state-of-the-art results across over 20 benchmarks.

Chinese AI startup Zhipu AI has unveiled its GLM-4.6V series, a new generation of open-source vision-language models. These models are optimized for multimodal reasoning and frontend automation, featuring native function calling that allows direct use of tools with visual inputs. The series boasts a 128,000 token context length and state-of-the-art results across over 20 benchmarks, positioning it as a strong competitor in the AI landscape.
The GLM-4.6V models utilize an encoder-decoder architecture with a Vision Transformer and an LLM decoder, supporting arbitrary image resolutions and video inputs. A key innovation is native multimodal function calling, which eliminates the need for text-only conversions when integrating visual assets with tools. This enables tasks like generating structured reports from mixed documents and performing visual web searches.
Distributed under the permissive MIT license, GLM-4.6V is suitable for enterprise adoption, offering flexibility for proprietary systems and local deployments. The models have demonstrated high performance, with the 106B version outperforming larger models on long-context tasks and video summarization, while the 9B Flash variant excels among lightweight models.




