What is Zhipu AI's new GLM-4.6V model?

GLM-4.6V is an open-source vision-language model from Zhipu AI, designed for multimodal reasoning and automation with native tool integration.

What makes Zhipu AI's GLM-4.6V unique?

It features native function calling for direct visual tool use, supports a 128,000 token context length, and achieves state-of-the-art results on many benchmarks.

Can I use GLM-4.6V for commercial purposes?

Yes, GLM-4.6V is released under the MIT license, allowing for free commercial and non-commercial use, modification, and redistribution.

Home / Technology / AI Sees, Reasons, and Acts: New Vision Model Unveiled

AI Sees, Reasons, and Acts: New Vision Model Unveiled

9 Dec

•

Summary

New open-source vision-language model enhances multimodal reasoning.
Introduces native function calling for direct tool integration.
Achieves state-of-the-art results across over 20 benchmarks.

AI Sees, Reasons, and Acts: New Vision Model Unveiled

Chinese AI startup Zhipu AI has unveiled its GLM-4.6V series, a new generation of open-source vision-language models. These models are optimized for multimodal reasoning and frontend automation, featuring native function calling that allows direct use of tools with visual inputs. The series boasts a 128,000 token context length and state-of-the-art results across over 20 benchmarks, positioning it as a strong competitor in the AI landscape.

The GLM-4.6V models utilize an encoder-decoder architecture with a Vision Transformer and an LLM decoder, supporting arbitrary image resolutions and video inputs. A key innovation is native multimodal function calling, which eliminates the need for text-only conversions when integrating visual assets with tools. This enables tasks like generating structured reports from mixed documents and performing visual web searches.

Distributed under the permissive MIT license, GLM-4.6V is suitable for enterprise adoption, offering flexibility for proprietary systems and local deployments. The models have demonstrated high performance, with the 106B version outperforming larger models on long-context tasks and video summarization, while the 9B Flash variant excels among lightweight models.