Home > Chip + Interface IP Glossary > Multi-modal
Table of Contents
Multi-modal refers to systems, technologies, or models that can process and integrate information from multiple types of data sources or input modalities, such as text, images, audio, video, and sensor data. In computing and artificial intelligence (AI), multi-modal architectures are designed to understand and respond to complex, real-world inputs by combining insights from different data types.
Multi-modal systems use specialized encoders for each data type and then fuse the outputs into a unified representation. This fusion can occur at various stages—early (input-level), intermediate (feature-level), or late (decision-level)—depending on the application. The integrated representation allows the system to make more informed decisions, generate richer outputs, or perform tasks like cross-modal retrieval, multi-modal classification, and generative modeling.
For example, a multi-modal AI model might analyze a video by combining visual frames, spoken dialogue, and textual metadata to understand context and sentiment.
Multi-modal systems are powered by:
