多模态AI：ChatGPT何时能看懂图片和视频？

前言

还记得第一次使用ChatGPT时的震撼吗？仅仅通过文字对话，AI就能理解我们的意图，回答复杂问题，甚至帮我们写代码。但你有没有想过，如果ChatGPT不仅能"听懂"文字，还能"看懂"图片、"理解"视频，甚至"听懂"音频，那会是什么样的体验？

这就是多模态AI正在实现的未来。作为Java开发者，我们每天都在处理各种数据格式——JSON、XML、图片、视频文件。而多模态AI，就像是一个超级全能的数据处理器，能够同时理解和处理所有这些不同类型的数据。

今天，让我们一起探索这个令人兴奋的技术领域，看看它如何改变我们对AI的认知，以及它将如何影响我们的开发工作。

什么是多模态AI？

从单模态到多模态的进化

想象一下，传统的AI就像一个只会看文字的专家：

// 传统单模态AI的处理方式
public class TraditionalAI {
    public String processText(String input) {
        // 只能处理文本输入
        return analyzeText(input);
    }
}

而多模态AI则像一个全能专家：

// 多模态AI的处理方式
public class MultimodalAI {
    public Response process(MultimodalInput input) {
        String textResult = processText(input.getText());
        String imageResult = processImage(input.getImage());
        String audioResult = processAudio(input.getAudio());
        String videoResult = processVideo(input.getVideo());
        
        // 融合多种模态的信息
        return fuseResults(textResult, imageResult, audioResult, videoResult);
    }
}

多模态AI的核心特征

1. 多输入理解

文本：自然语言处理
图像：计算机视觉
音频：语音识别和音频分析
视频：时序视觉理解

2. 跨模态关联

理解图片中的文字描述
将语音与视觉内容关联
分析视频中的动作和对话

3. 统一输出

可以用文字描述图片
可以根据描述生成图片
可以创建多媒体内容

多模态AI的技术架构

让我们用Java开发者熟悉的方式来理解多模态AI的架构：

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1e90ff', 'primaryTextColor': '#fff' }}}%%
flowchart TD
    A[多模态输入] --> B[模态编码器]
    B --> C[特征提取]
    C --> D[跨模态融合]
    D --> E[统一表示]
    E --> F[任务解码器]
    F --> G[多模态输出]
    
    subgraph encoders [模态编码器]
        B1[文本编码器
BERT/GPT]
        B2[图像编码器
ViT/ResNet]
        B3[音频编码器
Wav2Vec]
        B4[视频编码器
3D CNN]
    end
    
    subgraph fusion [融合策略]
        D1[早期融合
特征级]
        D2[中期融合
注意力机制]
        D3[后期融合
决策级]
    end
    
    B --> encoders
    D --> fusion

核心组件解析

1. 模态编码器（Modal Encoders）

就像我们在Java中为不同数据类型定义不同的解析器：

public interface ModalEncoder<T> {
    Vector encode(T input);
}

public class TextEncoder implements ModalEncoder<String> {
    @Override
    public Vector encode(String text) {
        // 使用Transformer模型编码文本
        return transformerModel.encode(text);
    }
}

public class ImageEncoder implements ModalEncoder<BufferedImage> {
    @Override
    public Vector encode(BufferedImage image) {
        // 使用CNN模型编码图像
        return cnnModel.encode(image);
    }
}

2. 跨模态融合（Cross-Modal Fusion）

这是多模态AI的核心，类似于数据库中的JOIN操作：

public class CrossModalFusion {
    public Vector fuse(Vector textFeature, Vector imageFeature) {
        // 注意力机制融合
        AttentionWeights weights = calculateAttention(textFeature, imageFeature);
        return weights.apply(textFeature, imageFeature);
    }
    
    private AttentionWeights calculateAttention(Vector text, Vector image) {
        // 计算文本和图像特征之间的相关性
        double similarity = cosineSimilarity(text, image);
        return new AttentionWeights(similarity);
    }
}

多模态AI发展历程

让我们回顾一下多模态AI的发展历程，就像追踪一个开源项目的版本演进：

%%{init: {'theme': 'base', 'themeVariables': { 'primaryColor': '#1e90ff', 'primaryTextColor': '#fff' }}}%%
timeline
    title 多模态AI发展历程
    
    2017 : Transformer架构诞生
         : "Attention is All You Need"
    
    2021 : CLIP模型发布
         : OpenAI实现图文理解
    
    2022 : DALL-E 2发布
         : 文本生成图像突破
    
    2023 : GPT-4V发布
         : ChatGPT获得视觉能力
         : Google Gemini发布
    
    2024 : GPT-4o发布
         : 实时多模态交互
         : Claude 3.5 Sonnet
         : 视频理解能力提升

关键里程碑

2017年：Transformer的诞生

Google发布"Attention is All You Need"论文
为多模态AI奠定了技术基础
注意力机制成为核心技术

2021年：CLIP的突破

OpenAI发布CLIP（Contrastive Language-Image Pre-training）
首次实现大规模图文理解
零样本图像分类能力

2022年：生成式AI爆发

DALL-E 2：文本生成高质量图像
Stable Diffusion：开源图像生成
多模态生成能力大幅提升

2023年：商业化应用

GPT-4V：ChatGPT获得视觉能力
Google Gemini：原生多模态设计
多模态AI进入主流应用

2024年：实时交互时代

GPT-4o：实时语音、视觉交互
Claude 3.5 Sonnet：强化视觉理解
视频理解能力显著提升

主流多模态AI模型对比

作为开发者，我们来看看目前市场上主要的多模态AI模型，就像选择技术栈一样：

模型	开发商	支持模态	主要特点	适用场景
GPT-4o	OpenAI	文本+图像+音频	实时交互，响应速度快	对话助手，内容创作
Claude 3.5 Sonnet	Anthropic	文本+图像	强大的视觉理解能力	文档分析，图表解读
Gemini Pro	Google	文本+图像+音频+视频	长上下文，多模态原生	复杂任务，视频分析
LLaVA	开源社区	文本+图像	开源，可定制化	研究，私有部署

技术对比分析

性能表现（基于公开基准测试）：

public class ModelPerformance {
    // 视觉问答准确率（VQA）
    private static final Map<String, Double> VQA_ACCURACY = Map.of(
        "GPT-4o", 85.7,
        "Claude-3.5-Sonnet", 88.3,
        "Gemini-Pro", 82.1,
        "LLaVA-1.5", 78.5
    );
    
    // 图像描述质量（BLEU分数）
    private static final Map<String, Double> CAPTION_QUALITY = Map.of(
        "GPT-4o", 42.1,
        "Claude-3.5-Sonnet", 45.2,
        "Gemini-Pro", 40.8,
        "LLaVA-1.5", 38.9
    );
}

多模态AI的应用场景

1. 智能客服系统

想象一个能够理解用户上传的截图、语音消息和文字描述的客服系统：

@Service
public class MultimodalCustomerService {
    
    @Autowired
    private MultimodalAI aiModel;
    
    public ServiceResponse handleCustomerQuery(CustomerInput input) {
        // 分析用户输入的多种模态信息
        MultimodalAnalysis analysis = aiModel.analyze(
            input.getText(),
            input.getScreenshot(),
            input.getVoiceMessage()
        );
        
        // 生成综合解决方案
        return generateSolution(analysis);
    }
    
    private ServiceResponse generateSolution(MultimodalAnalysis analysis) {
        // 基于多模态理解生成解决方案
        String textResponse = analysis.getTextSolution();
        List<String> visualSteps = analysis.getVisualInstructions();
        String videoTutorial = analysis.getVideoRecommendation();
        
        return new ServiceResponse(textResponse, visualSteps, videoTutorial);
    }
}

2. 代码审查助手

一个能够理解代码、架构图和文档的AI助手：

@Component
public class CodeReviewAssistant {
    
    public ReviewResult reviewCode(CodeSubmission submission) {
        // 分析代码文件
        CodeAnalysis codeAnalysis = analyzeCode(submission.getCodeFiles());
        
        // 理解架构图
        ArchitectureAnalysis archAnalysis = analyzeArchitecture(
            submission.getArchitectureDiagram()
        );
        
        // 解读设计文档
        DocumentAnalysis docAnalysis = analyzeDocuments(
            submission.getDesignDocuments()
        );
        
        // 综合分析
        return generateReview(codeAnalysis, archAnalysis, docAnalysis);
    }
    
    private ReviewResult generateReview(
            CodeAnalysis code, 
            ArchitectureAnalysis arch, 
            DocumentAnalysis doc) {
        
        List<Issue> issues = new ArrayList<>();
        
        // 检查代码与架构的一致性
        if (!code.matchesArchitecture(arch)) {
            issues.add(new Issue("代码实现与架构设计不符"));
        }
        
        // 检查文档与实现的一致性
        if (!doc.matchesImplementation(code)) {
            issues.add(new Issue("文档描述与代码实现不一致"));
        }
        
        return new ReviewResult(issues, generateSuggestions());
    }
}

3. 智能监控系统

结合日志、监控图表和告警信息的智能运维：

@Service
public class IntelligentMonitoring {
    
    public IncidentAnalysis analyzeIncident(IncidentData incident) {
        // 分析错误日志
        LogAnalysis logAnalysis = analyzeErrorLogs(incident.getLogs());
        
        // 理解监控图表
        MetricAnalysis metricAnalysis = analyzeMetricCharts(
            incident.getPerformanceCharts()
        );
        
        // 分析告警信息
        AlertAnalysis alertAnalysis = analyzeAlerts(incident.getAlerts());
        
        // 综合诊断
        return diagnoseIssue(logAnalysis, metricAnalysis, alertAnalysis);
    }
    
    private IncidentAnalysis diagnoseIssue(
            LogAnalysis logs, 
            MetricAnalysis metrics, 
            AlertAnalysis alerts) {
        
        // AI分析根本原因
        String rootCause = identifyRootCause(logs, metrics, alerts);
        
        // 生成解决建议
        List<String> solutions = generateSolutions(rootCause);
        
        // 预测影响范围
        ImpactAssessment impact = assessImpact(metrics, alerts);
        
        return new IncidentAnalysis(rootCause, solutions, impact);
    }
}

技术挑战与解决方案

1. 数据对齐问题

不同模态的数据在时间和空间上的对齐是一个重大挑战：

public class ModalityAlignment {
    
    // 时间对齐：音频和视频同步
    public AlignedData alignTemporalData(AudioData audio, VideoData video) {
        // 使用时间戳对齐
        long startTime = Math.max(audio.getStartTime(), video.getStartTime());
        long endTime = Math.min(audio.getEndTime(), video.getEndTime());
        
        AudioSegment alignedAudio = audio.getSegment(startTime, endTime);
        VideoSegment alignedVideo = video.getSegment(startTime, endTime);
        
        return new AlignedData(alignedAudio, alignedVideo);
    }
    
    // 语义对齐：文本和图像内容对应
    public SemanticAlignment alignSemanticContent(String text, BufferedImage image) {
        // 提取文本中的视觉概念
        List<VisualConcept> textConcepts = extractVisualConcepts(text);
        
        // 检测图像中的对象
        List<DetectedObject> imageObjects = detectObjects(image);
        
        // 建立对应关系
        Map<VisualConcept, DetectedObject> alignment = 
            matchConceptsToObjects(textConcepts, imageObjects);
        
        return new SemanticAlignment(alignment);
    }
}

2. 计算资源优化

多模态AI模型通常需要大量计算资源，我们需要优化策略：

@Configuration
public class MultimodalOptimization {
    
    // 模型量化减少内存占用
    @Bean
    public QuantizedModel createQuantizedModel() {
        return ModelQuantizer.quantize(
            originalModel, 
            QuantizationLevel.INT8  // 8位量化
        );
    }
    
    // 分层处理策略
    @Bean
    public LayeredProcessor createLayeredProcessor() {
        return LayeredProcessor.builder()
            .addLayer(new FastPreprocessor())     // 快速预处理
            .addLayer(new EfficientEncoder())     // 高效编码
            .addLayer(new OptimizedFusion())      // 优化融合
            .build();
    }
    
    // 缓存策略
    @Bean
    @Scope("singleton")
    public FeatureCache createFeatureCache() {
        return new FeatureCache(
            CacheConfig.builder()
                .maxSize(1000)
                .expireAfterWrite(Duration.ofHours(1))
                .build()
        );
    }
}

3. 模型可解释性

提高多模态AI决策的可解释性：

public class ExplainableMultimodalAI {
    
    public ExplanationResult explainDecision(
            MultimodalInput input, 
            AIDecision decision) {
        
        // 分析各模态的贡献度
        Map<Modality, Double> contributions = calculateContributions(input);
        
        // 生成注意力热图
        AttentionHeatmap attentionMap = generateAttentionMap(input);
        
        // 提取关键特征
        List<KeyFeature> keyFeatures = extractKeyFeatures(input, decision);
        
        // 生成自然语言解释
        String explanation = generateExplanation(
            contributions, 
            keyFeatures, 
            decision
        );
        
        return new ExplanationResult(
            explanation, 
            attentionMap, 
            contributions
        );
    }
    
    private String generateExplanation(
            Map<Modality, Double> contributions,
            List<KeyFeature> keyFeatures,
            AIDecision decision) {
        
        StringBuilder explanation = new StringBuilder();
        explanation.append("AI决策基于以下分析：\n");
        
        // 解释各模态贡献
        contributions.entrySet().stream()
            .sorted(Map.Entry.<Modality, Double>comparingByValue().reversed())
            .forEach(entry -> {
                explanation.append(String.format(
                    "- %s贡献了%.1f%%的决策权重\n", 
                    entry.getKey().getName(), 
                    entry.getValue() * 100
                ));
            });
        
        // 解释关键特征
        explanation.append("\n关键特征包括：\n");
        keyFeatures.forEach(feature -> {
            explanation.append(String.format("- %s\n", feature.getDescription()));
        });
        
        return explanation.toString();
    }
}

实际应用案例

案例1：智能文档处理系统

某企业开发了一个能够处理各种格式文档的AI系统：

@RestController
@RequestMapping("/api/document")
public class DocumentProcessingController {
    
    @Autowired
    private MultimodalDocumentProcessor processor;
    
    @PostMapping("/analyze")
    public DocumentAnalysisResult analyzeDocument(
            @RequestParam("file") MultipartFile file,
            @RequestParam(value = "query", required = false) String query) {
        
        try {
            // 检测文档类型
            DocumentType type = detectDocumentType(file);
            
            // 多模态处理
            DocumentAnalysisResult result = switch (type) {
                case PDF_WITH_IMAGES -> processor.processPdfWithImages(file, query);
                case EXCEL_WITH_CHARTS -> processor.processExcelWithCharts(file, query);
                case POWERPOINT -> processor.processPowerPoint(file, query);
                case SCANNED_DOCUMENT -> processor.processScannedDocument(file, query);
                default -> processor.processGenericDocument(file, query);
            };
            
            return result;
            
        } catch (Exception e) {
            log.error("文档处理失败", e);
            throw new DocumentProcessingException("文档处理失败: " + e.getMessage());
        }
    }
}

@Service
public class MultimodalDocumentProcessor {
    
    public DocumentAnalysisResult processPdfWithImages(
            MultipartFile file, String query) {
        
        // 提取文本内容
        String textContent = pdfTextExtractor.extract(file);
        
        // 提取图像
        List<BufferedImage> images = pdfImageExtractor.extract(file);
        
        // 多模态分析
        MultimodalAnalysis analysis = aiModel.analyze(
            textContent, 
            images, 
            query
        );
        
        return DocumentAnalysisResult.builder()
            .summary(analysis.getSummary())
            .keyPoints(analysis.getKeyPoints())
            .imageDescriptions(analysis.getImageDescriptions())
            .answerToQuery(analysis.getQueryAnswer())
            .build();
    }
}

案例2：智能监控告警系统

某互联网公司的智能运维系统：

@Component
public class IntelligentAlertSystem {
    
    @EventListener
    public void handleSystemAlert(SystemAlertEvent event) {
        // 收集多模态数据
        MultimodalAlertData alertData = collectAlertData(event);
        
        // AI分析
        AlertAnalysisResult analysis = analyzeAlert(alertData);
        
        // 自动响应
        if (analysis.getSeverity() == Severity.CRITICAL) {
            executeAutoResponse(analysis);
        }
        
        // 通知相关人员
        notifyStakeholders(analysis);
    }
    
    private MultimodalAlertData collectAlertData(SystemAlertEvent event) {
        return MultimodalAlertData.builder()
            // 文本数据：日志、错误信息
            .errorLogs(logService.getRecentLogs(event.getServiceName()))
            .errorMessages(event.getErrorMessages())
            
            // 图像数据：监控图表、系统拓扑
            .performanceCharts(monitoringService.getPerformanceCharts(
                event.getServiceName(), Duration.ofHours(1)))
            .systemTopology(topologyService.getCurrentTopology())
            
            // 时序数据：指标变化趋势
            .metricTimeSeries(metricsService.getTimeSeries(
                event.getServiceName(), Duration.ofHours(2)))
            
            .build();
    }
    
    private AlertAnalysisResult analyzeAlert(MultimodalAlertData data) {
        // 使用多模态AI分析
        return aiAnalyzer.analyze(
            data.getErrorLogs(),           // 文本分析
            data.getPerformanceCharts(),   // 图像理解
            data.getMetricTimeSeries()     // 时序分析
        );
    }
}

开发实践指南

1. 选择合适的多模态AI服务

作为Java开发者，我们需要根据项目需求选择合适的AI服务：

public enum MultimodalAIProvider {
    OPENAI_GPT4O("OpenAI GPT-4o", "适合对话和内容创作", true, false),
    ANTHROPIC_CLAUDE("Anthropic Claude", "适合文档分析", true, false),
    GOOGLE_GEMINI("Google Gemini", "适合复杂推理", true, true),
    AZURE_COGNITIVE("Azure Cognitive Services", "适合企业集成", true, true);
    
    private final String name;
    private final String useCase;
    private final boolean supportsImage;
    private final boolean supportsVideo;
    
    // 根据需求选择提供商
    public static MultimodalAIProvider selectProvider(ProjectRequirements requirements) {
        if (requirements.needsVideoProcessing()) {
            return GOOGLE_GEMINI;
        }
        if (requirements.needsDocumentAnalysis()) {
            return ANTHROPIC_CLAUDE;
        }
        if (requirements.needsRealTimeInteraction()) {
            return OPENAI_GPT4O;
        }
        return AZURE_COGNITIVE; // 企业级默认选择
    }
}

2. 构建多模态数据处理管道

@Configuration
public class MultimodalPipelineConfig {
    
    @Bean
    public MultimodalProcessor createProcessor() {
        return MultimodalProcessor.builder()
            .addPreprocessor(new ImagePreprocessor())
            .addPreprocessor(new TextPreprocessor())
            .addPreprocessor(new AudioPreprocessor())
            .setFusionStrategy(new AttentionBasedFusion())
            .setPostprocessor(new ResultPostprocessor())
            .build();
    }
    
    @Bean
    public DataPipeline createDataPipeline() {
        return DataPipeline.builder()
            .source(new MultimodalDataSource())
            .transform(new ModalityAligner())
            .transform(new FeatureExtractor())
            .transform(new QualityFilter())
            .sink(new ProcessedDataSink())
            .build();
    }
}

3. 错误处理和降级策略

@Service
public class RobustMultimodalService {
    
    @Retryable(value = {AIServiceException.class}, maxAttempts = 3)
    public MultimodalResponse processWithFallback(MultimodalInput input) {
        try {
            // 尝试完整多模态处理
            return fullMultimodalProcess(input);
            
        } catch (VideoProcessingException e) {
            log.warn("视频处理失败，降级到图像+文本处理", e);
            return processImageAndText(input);
            
        } catch (ImageProcessingException e) {
            log.warn("图像处理失败，降级到纯文本处理", e);
            return processTextOnly(input);
            
        } catch (Exception e) {
            log.error("多模态处理完全失败", e);
            return createErrorResponse(e);
        }
    }
    
    private MultimodalResponse processImageAndText(MultimodalInput input) {
        // 移除视频模态，只处理图像和文本
        MultimodalInput reducedInput = input.toBuilder()
            .video(null)
            .build();
        return aiService.process(reducedInput);
    }
    
    private MultimodalResponse processTextOnly(MultimodalInput input) {
        // 只处理文本模态
        return textOnlyAI.process(input.getText());
    }
}

性能优化策略

1. 模态选择性处理

@Component
public class AdaptiveModalityProcessor {
    
    public MultimodalResponse processAdaptively(
            MultimodalInput input, 
            ProcessingBudget budget) {
        
        // 根据预算选择处理的模态
        Set<Modality> selectedModalities = selectModalities(input, budget);
        
        // 并行处理选中的模态
        Map<Modality, Future<ModalityResult>> futures = selectedModalities.stream()
            .collect(Collectors.toMap(
                modality -> modality,
                modality -> processModalityAsync(input, modality)
            ));
        
        // 收集结果
        Map<Modality, ModalityResult> results = new HashMap<>();
        for (Map.Entry<Modality, Future<ModalityResult>> entry : futures.entrySet()) {
            try {
                results.put(entry.getKey(), entry.getValue().get(budget.getTimeoutMs(), TimeUnit.MILLISECONDS));
            } catch (TimeoutException e) {
                log.warn("模态{}处理超时", entry.getKey());
            }
        }
        
        return fuseResults(results);
    }
    
    private Set<Modality> selectModalities(MultimodalInput input, ProcessingBudget budget) {
        Set<Modality> available = input.getAvailableModalities();
        
        if (budget.isLowBudget()) {
            // 低预算：只选择最重要的模态
            return available.stream()
                .sorted(Comparator.comparing(this::getModalityImportance).reversed())
                .limit(2)
                .collect(Collectors.toSet());
        } else {
            // 高预算：处理所有可用模态
            return available;
        }
    }
}

2. 缓存策略

@Service
public class MultimodalCacheService {
    
    @Cacheable(value = "imageFeatures", key = "#imageHash")
    public ImageFeatures extractImageFeatures(String imageHash, BufferedImage image) {
        return imageEncoder.encode(image);
    }
    
    @Cacheable(value = "textEmbeddings", key = "#text.hashCode()")
    public TextEmbedding extractTextEmbedding(String text) {
        return textEncoder.encode(text);
    }
    
    @CacheEvict(value = {"imageFeatures", "textEmbeddings"}, allEntries = true)
    @Scheduled(fixedRate = 3600000) // 每小时清理一次
    public void clearCache() {
        log.info("清理多模态特征缓存");
    }
}

未来发展趋势

1. 技术发展方向

更长的上下文窗口

当前：处理几分钟的视频
未来：处理小时级别的长视频内容

实时交互能力

当前：批处理模式
未来：毫秒级实时响应

更多模态支持

当前：文本、图像、音频、视频
未来：触觉、嗅觉、温度等传感器数据

2. 应用场景扩展

// 未来的多模态AI应用示例
public class FutureMultimodalApplications {
    
    // 全息会议助手
    public MeetingInsights analyzeHolographicMeeting(
            HolographicData hologram,
            AudioStream audio,
            GestureData gestures,
            EmotionData emotions) {
        // 分析3D空间中的交互
        // 理解语音、手势、表情
        // 生成会议洞察
    }
    
    // 智能制造质检
    public QualityAssessment inspectProduct(
            VisualData cameras,
            ThermalData thermal,
            VibrationData vibration,
            SoundData audio) {
        // 多传感器融合质量检测
        // 预测潜在故障
        // 优化生产流程
    }
    
    // 个性化教育助手
    public LearningPlan createPersonalizedPlan(
            StudentProfile profile,
            LearningHistory history,
            EmotionalState emotion,
            AttentionData attention) {
        // 分析学习状态
        // 调整教学策略
        // 优化学习体验
    }
}

3. 对开发者的影响

新的技能要求

多模态数据处理
AI模型集成
跨领域知识融合

开发工具演进

多模态IDE插件
智能代码生成
自动化测试工具

架构设计变化

事件驱动架构
微服务解耦
边缘计算集成

实践建议

1. 学习路径

基础知识

深度学习基础
计算机视觉入门
自然语言处理基础
音频信号处理

实践项目

图像分类应用
文本情感分析
多模态聊天机器人
视频内容理解系统

2. 技术选型建议

小型项目

使用现成的API服务
关注成本控制
快速原型验证

中型项目

考虑混合方案
平衡性能和成本
建立监控体系

大型项目

自建模型训练
优化推理性能
构建完整生态

3. 常见陷阱避免

// 避免的常见错误
public class CommonPitfalls {
    
    // 错误：忽略数据质量
    public void badExample() {
        // 直接使用原始数据，没有预处理
        MultimodalInput rawInput = new MultimodalInput(rawText, rawImage, rawAudio);
        aiModel.process(rawInput); // 可能导致错误结果
    }
    
    // 正确：重视数据预处理
    public void goodExample() {
        // 数据清洗和预处理
        String cleanedText = textPreprocessor.clean(rawText);
        BufferedImage normalizedImage = imagePreprocessor.normalize(rawImage);
        AudioData filteredAudio = audioPreprocessor.filter(rawAudio);
        
        MultimodalInput processedInput = new MultimodalInput(
            cleanedText, normalizedImage, filteredAudio);
        aiModel.process(processedInput);
    }
    
    // 错误：忽略错误处理
    public void badErrorHandling() {
        try {
            return aiModel.process(input);
        } catch (Exception e) {
            return null; // 简单返回null
        }
    }
    
    // 正确：完善的错误处理
    public MultimodalResponse goodErrorHandling(MultimodalInput input) {
        try {
            return aiModel.process(input);
        } catch (ModelOverloadException e) {
            // 模型过载，使用降级策略
            return fallbackProcessor.process(input);
        } catch (InvalidInputException e) {
            // 输入无效，返回错误信息
            return MultimodalResponse.error("输入数据格式不正确: " + e.getMessage());
        } catch (Exception e) {
            // 其他异常，记录日志并返回通用错误
            log.error("多模态处理失败", e);
            return MultimodalResponse.error("处理失败，请稍后重试");
        }
    }
}