多模态机器学习(MultiModal Machine Learning, MMML)是一种机器学习方法,它旨在解决复杂任务,如多模态情感分析、跨语言图像搜索等,这些任务需要同时考虑多种模态的数据并从中提取有用的信息。






1.Multiplicative Interactions and Where to Find Them



2.Tensor fusion network for multimodal sentiment analysis 3.On the Benefits of Early Fusion in Multimodal Representation Learning 4.Extending long short-term memory for multi-view structured learning 5.Devise: A deep visual-semantic embedding model 6.Learning transferable visual models from natural language supervision 7.Order-embeddings of images and language 8.Learning Concept Taxonomies from Multi-modal Data 9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! 10.Learning factorized multimodal representations 11.Multimodal clustering networks for self-supervised learning from unlabeled videos 12.Deep multimodal subspace clustering networks


1.Visual Referring Expression Recognition: What Do Systems Actually Learn?



2.Unsupervised multimodal representation learning across medical images and reports 3.Clip-event: Connecting text and images with event structures 4.Learning by aligning videos in time 5.Multimodal adversarial network for cross-modal retrieval 6.Videobert: A joint model for video and language representation learning 7.Visualbert: A simple and performant baseline for vision and language 8.Decoupling the role of data, attention, and losses in multimodal transformers 9.Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks 10.MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences


1.Neural module networks



2.Dynamic memory networks for visual and textual question answering 3.A Survey of Reinforcement Learning Informed by Natural Language 4.Mfas: Multimodal fusion architecture search 5.Multi-view intact space learning 6.Neuro-Symbolic Visual Reasoning: Disentangling Visual from Reasoning 7.Probabilistic neural symbolic models for interpretable visual question answering 8.Learning by abstraction: The neural state machine 9.Socratic models: Composing zero-shot multimodal reasoning with language 10.Vqa-lol: Visual question answering under the lens of logic 11.Multimodal logical inference system for visual-textual entailment 12.Towards causal vqa: Revealing and reducing spurious correlations by invariant and covariant semantic editing 13.Counterfactual vqa: A cause-effect look at language bias 14.Exploring visual relationship for image captioning 15.KAT: A Knowledge Augmented Transformer for Vision-and-Language 16.Building a large-scale multimodal knowledge base system for answering visual queries 17.Visualcomet: Reasoning about the dynamic context of a still image 18.From Recognition to Cognition: Visual Commonsense Reasoning


1.Multimodal summarization of complex sentences



2.Extractive Text-Image Summarization Using Multi-Modal RNN 3.Multi-modal Summarization for Asynchronous Collection of Text, Image, Audio and Video 4.Multimodal abstractive summarization ` for how2 videos 5.Deep fragment embeddings for bidirectional image sentence mapping 6.Phrase-based image captioning 7.Style transfer for co-speech gesture animation: A multi-speaker conditional-mixture approach 8.You said that?: Synthesising talking faces from audio 9.Zero-shot text-to-image generation 10.Stochastic video generation with a learned prior 11.Parallel wavenet: Fast high-fidelity speech synthesis 12.Arbitrary talking face generation via attentional audio-visual coherence learning


1.Integrating Multimodal Information in Large Pretrained Transformers


简述:这篇论文提出了一个叫做Multimodal Adaptation Gate(MAG)的装置,可以附加到BERT和XLNet上,让它们在微调期间接受多模态非语言数据。这个装置通过生成对BERT和XLNet内部表示的转变来实现,而这个转变是有条件于视觉和声学模态的。实验表明,微调MAG-BERT和MAG-XLNet可以显著提高情感分析性能,超过了以前的基线和仅语言微调的BERT和XLNet。在CMU-MOSI数据集上,MAG-XLNet首次实现了人类级别的多模态情感分析性能。

2.Multimodal few-shot learning with frozen language models 3.HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning 4.FLAVA: A Foundational Language And Vision Alignment Model 5.Pretrained transformers as universal computation engines 6.Scaling up visual and visual language representation learning with noisy text supervision 7.Foundations of multimodal co-learning 8.Found in translation: Learning robust joint representations by cyclic translations between modalities 9.Vokenization: Improving Language Understanding with Contextualized, VisualGrounded Supervision 10.Combining labeled and unlabeled data with co-training 11.Cross-modal data programming enables rapid medical machine learning 12.An information theoretic framework for multi-view learning 13.Comprehensive Semi-Supervised Multi-Modal Learning


1.Perceptual Score: What Data Modalities Does Your Model Perceive?



2.Multimodal explanations: Justifying decisions and pointing to the evidence 3.Women also snowboard: Overcoming bias in captioning models 4.FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment 5.Smil: Multimodal learning with severely missing modality 6.VL-InterpreT: An Interactive Visualization Tool for Interpreting Vision-Language Transformers 7.Behind the scene: Revealing the secrets of pre-trained vision-and-language models 8.Winoground: Probing Vision and Language Models for Visio-Linguistic Compositionality 9.Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think! 10.MultiViz: Towards Visualizing and Understanding Multimodal Models 11.M2Lens: Visualizing and explaining multimodal models for sentiment analysis 12. HighMMT: Towards Modality and Task Generalization for High-Modality Representation Learning 13.One model to learn them all 14.What Makes Training Multi-Modal Classification Networks Hard? 15.Characterizing and overcoming the greedy nature of learning in multi-modal deep neural networks 16.MultiBench: Multiscale Benchmarks for Multimodal Representation Learning




