Multimodal AI: Turning a One-Trick Pony into Jack of All Trades

Here’s what you need to know about multimodal AI, specific use cases, as well as the challenges that must be overcome to ensure its effective use.

Carlos Meléndez, Co-Founder and VP of Operations, Wovenware

June 21, 2024

4 Min Read
Human interacts with AI artificial intelligence brain processor
Chroma Craft Media Group via Alamy Stock

Just when you think artificial intelligence could not do more to reduce mundane workloads, create content from scratch, sort through massive amounts of data to derive insights, or identify anomalies on an X-ray, along comes multimodal AI.  

Until very recently, AI was mostly focused on understanding and processing singular text or image-based information – a one-trick pony, so to speak. Today, however, there’s a new entrant into the world of AI, a true jack of all trades in the form of multimodal AI. This new class of AI involves the integration of multiple modalities -- such as images, videos, audio, and text, able to process multiple data inputs.  

What multimodal AI really delivers is context. Since it can recognize patterns and connections between different types of data inputs, the output is richer and more intuitive, getting closer to multi-faceted human intelligence than ever before.  

Just as generative AI (GenAI) has done over the past year, multimodal AI promises to revolutionize almost all industries and bring a whole new level of insights and automation to human-machine interactions. 

Already many Big Tech players are volleying to dominate multimodal AI. One of the most recent players is X (formerly Twitter), which launched Grok 1.5, which it claims outperforms its competitors when it comes to real-world spatial understanding.  Other players include Apple MM1, Anthropic Claude 3, Google Gemini, Meta ImageBind and OpenAI GPT 4. 

Related:Help Your C-Suite Colleagues Navigate Generative AI

While AI comes in many forms -- from machine learning and deep learning -- to predictive analytics and computer vision, the real showstopper for multimodal AI is computer vision.  With multimodal AI, computer vision’s capabilities go far beyond simple object identification. With the ability to combine many types of data, the AI solution can understand the context of an image and make more accurate decisions. For example, the image of a cat, combined with audio of a cat meowing, gives it greater accuracy when identifying all images of cats. In another example, an image of a face, when combined with video can help AI not only identify specific people in photos, but greater contextual awareness.  

Multimodal AI Out in the Field 

Use cases for multimodal AI are just beginning to surface, and as it evolves it will be used in ways not even imaginable today.  Consider some of the ways it is or could be applied: 

  • Ecommerce.  Multimodal AI could analyze text, images and video in social media data to tailor offerings to specific people or segments of people. 

  • Automotive. Multimodal AI can improve the capabilities and safety of self-driving cars by combining data from multiple sensors, such as cameras, radar or GPS systems, for heightened accuracy.  

  • Healthcare. It can use data from images and scans, electronic health records and genetic testing results to assist clinicians in making more accurate diagnoses. As well as more personalized treatment plans.  

  • Finance.  It can enable heightened risk assessment by analyzing data in various formats to get deeper insights and understanding of specific individuals and their risk level for mortgages, etc. 

  • Conservation. Multimodal AI could identify whales from satellite imagery, as well as audio of whale sounds to track migration patterns and changing feeding areas. 

Related:The AI Skills Gap and How to Address It

The Challenges of Bringing Multimodal AI into Operations 

Multimodal AI is an exciting development, but it still has a long way to go. A fundamental challenge lies in integrating information from disparate sources cohesively. This involves developing algorithms and models capable of extracting meaningful insights from each modality and integrating them to generate comprehensive interpretations. 

Another challenge is the scarcity of clean, labeled multimodal datasets for training AI models. Unlike single-modality datasets, which are more plentiful, multimodal datasets require annotations that capture correlations between different modalities, making their creation more labor-intensive and resource-intensive. Yet achieving the right balance between modalities is crucial for ensuring the accuracy and reliability of multimodal AI systems. 

Related:AI, Data Centers, and Energy Use: The Path to Sustainability

As with other forms of AI, ensuring unbiased multimodal AI is a key consideration made more difficult because of the varied types of data. Regardless, diverse types of images, text, video, and audio need to be factored into the development of solutions, as well as the biases that can arise from the developers themselves.   

Data privacy and protection also need to be considered, given the vast amount of personal data that multimodal AI systems may process. Questions could arise about data ownership, consent, and protection against misuse, when humans are not fully in control of the output of AI.  

Addressing these ethical challenges requires a collaborative effort involving developers, government, industry leaders, and individuals. Transparency, accountability, and fairness must be prioritized throughout the development lifecycle of multimodal AI systems to mitigate their risks and foster trust among users. 

Multimodal AI is bringing the capabilities of AI to new heights, enabling richer and deeper insights than previously possible. Yet, no matter how smart AI becomes, it can never replace the human mind and its many facets of knowledge, intuition, experience and reasoning -- AI still has a long way to go to achieve that, but it’s a start. 


About the Author(s)

Carlos Meléndez

Co-Founder and VP of Operations, Wovenware

Carlos Meléndez, is co-founder and VP of Operations of Wovenware, a San Juan-based provider of custom software engineering and AI services, and an AI center of excellence for its parent company, Maxar Technologies.  Prior to cofounding Wovenware, Carlos was a senior software engineer with several start-up software firms and held strategic consulting positions with global consulting firm, Accenture.  

Never Miss a Beat: Get a snapshot of the issues affecting the IT industry straight to your inbox.

You May Also Like

More Insights