CVPR2024|AIGC相关论文汇总(如果觉得有帮助,欢迎点赞和收藏)
Awesome-CVPR2024-AIGC1.图像生成(Image Generation/Image Synthesis)Accelerating Diffusion Sampling with Optimized Time StepsAdversarial Text to Continuous Image GenerationAmodal Completion via Progressive Mixed Context DiffusionArbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural DecoderAtlantis: Enabling Underwater Depth Estimation with Stable DiffusionCapHuman: Capture Your Moments in Parallel UniversesCHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned NormalizationCheck, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image GenerationCoarse-to-Fine Latent Diffusion for Pose-Guided Person Image SynthesisCoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image GenerationCondition-Aware Neural Network for Controlled Image GenerationCosmicMan: A Text-to-Image Foundation Model for HumansCountering Personalized Text-to-Image Generation with Influence WatermarksCross Initialization for Personalized Text-to-Image GenerationCustomization Assistant for Text-to-image GenerationDeepCache: Accelerating Diffusion Models for FreeDemoFusion: Democratising High-Resolution Image Generation With No $Desigen: A Pipeline for Controllable Design Template GenerationDiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language ModelDiffusion-driven GAN Inversion for Multi-Modal Facial Image GenerationDistriFusion: Distributed Parallel Inference for High-Resolution Diffusion ModelsDiversity-aware Channel Pruning for StyleGAN CompressionDiscriminative Probing and Tuning for Text-to-Image GenerationDrag Your Noise: Interactive Point-based Editing via Diffusion Semantic PropagationDreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image PersonalizationDynamic Prompt Optimizing for Text-to-Image GenerationECLIPSE: A Resource-Efficient Text-to-Image Prior for Image GenerationsEfficient Dataset Distillation via Minimax DiffusionElasticDiffusion: Training-free Arbitrary Size Image GenerationEmoGen: Emotional Image Content Generation with Text-to-Image Diffusion ModelsEnabling Multi-Concept Fusion in Text-to-Image ModelsExact Fusion via Feature Distribution Matching for Few-shot Image GenerationFaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven GenerationFast ODE-based Sampling for Diffusion Models in Around 5 StepsFreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any ConditionFreeCustom: Tuning-Free Customized Image Generation for Multi-Concept CompositionGeneralizable Tumor SynthesisGenerative Unlearning for Any IdentityHanDiffuser: Text-to-Image Generation With Realistic Hand AppearancesHigh-fidelity Person-centric Subject-to-Image SynthesisInstantBooth: Personalized Text-to-Image Generation without Test-Time FinetuningInstanceDiffusion: Instance-level Control for Image GenerationInstruct-Imagen: Image Generation with Multi-modal InstructionIntelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion ModelsInteractDiffusion: Interaction-Control for Text-to-Image Diffusion ModelIntriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative ModelsInversion-Free Image Editing with Natural LanguageJeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image GenerationLAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented DiffusionLearned representation-guided diffusion models for large-image generationLearning Continuous 3D Words for Text-to-Image GenerationLearning Disentangled Identifiers for Action-Customized Text-to-Image GenerationLearning Multi-dimensional Human Preference for Text-to-Image GenerationLeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion ModelMACE: Mass Concept Erasure in Diffusion ModelsMarkovGen: Structured Prediction for Efficient Text-to-Image GenerationMedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual InvariantMIGC: Multi-Instance Generation Controller for Text-to-Image SynthesisMULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image GenerationOn the Scalability of Diffusion-based Text-to-Image GenerationPersonalized Residuals for Concept-Driven Text-to-Image GenerationPhotoMaker: Customizing Realistic Human Photos via Stacked ID EmbeddingPLACE: Adaptive Layout-Semantic Fusion for Semantic Image SynthesisPrompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion ModelsReadout Guidance: Learning Control from Diffusion FeaturesRelation Rectification in Diffusion ModelResidual Denoising Diffusion ModelsRethinking FID: Towards a Better Evaluation Metric for Image GenerationRetrieval-Augmented Layout Transformer for Content-Aware Layout GenerationRich Human Feedback for Text-to-Image GenerationSCoFT: Self-Contrastive Fine-Tuning for Equitable Image GenerationSelf-correcting LLM-controlled Diffusion ModelsSelf-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image GenerationShadow Generation for Composite Image Using Diffusion ModelSmooth Diffusion: Crafting Smooth Latent Spaces in Diffusion ModelsSSR-Encoder: Encoding Selective Subject Representation for Subject-Driven GenerationStableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-OnStructure-Guided Adversarial Training of Diffusion ModelsStyle Aligned Image Generation via Shared AttentionSVGDreamer: Text Guided SVG Generation with Diffusion ModelTailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt RewritingTackling the Singularities at the Endpoints of Time Intervals in Diffusion ModelsTaming Stable Diffusion for Text to 360∘ Panorama Image GenerationTextCraftor: Your Text Encoder Can be Image Quality ControllerText-Guided Variational Image Generation for Industrial Anomaly Detection and SegmentationTFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion ModelsTokenCompose: Grounding Diffusion with Token-level SupervisionTowards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image GenerationTraining Diffusion Models Towards Diverse Image Generation with Reinforcement LearningUFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANsUniGS: Unified Representation for Image Generation and SegmentationViewDiff: 3D-Consistent Image Generation with Text-To-Image ModelsWhen StyleGAN Meets Stable Diffusion: a 풲+ Adapter for Personalized Image GenerationX-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
2.图像编辑(Image Editing)An Edit Friendly DDPM Noise Space: Inversion and ManipulationsContent-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground TruthContrastive Denoising Score for Text-guided Latent Diffusion Image EditingDEADiff: An Efficient Stylization Diffusion Model with Disentangled RepresentationsDeformable One-shot Face Stylization via DINO Semantic GuidanceDiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy ProtectionDiffMorpher: Unleashing the Capability of Diffusion Models for Image MorphingDiffusion Models Without AttentionDoubly Abductive Counterfactual Inference for Text-based Image EditingEdit One for All: Interactive Batch Image EditingFace2Diffusion for Fast and Editable Face PersonalizationFocus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention ModulationHolo-Relighting: Controllable Volumetric Portrait Relighting from a Single ImageInversion-Free Image Editing with Natural LanguagePAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion ModelsPerson in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image EditingPIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image ModelsFreeDrag: Feature Dragging for Reliable Point-based Image EditingRealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image CustomizationSmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language ModelsStyle Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style TransferSwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait RelightingText-Driven Image Editing via Learnable RegionsUniHuman: A Unified Model For Editing Human Images in the Wild
3.视频生成(Video Generation/Video Synthesis)360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion ModelA Recipe for Scaling up Text-to-Video Generation with Text-free VideosCo-Speech Gesture Video Generation via Motion-Decoupled Diffusion ModelDelving Deep into Diffusion Transformers for Image and Video GenerationDiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video GenerationDisCo: Disentangled Control for Realistic Human Dance GenerationFlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video GenerationGenerative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion ModelsHierarchical Patch-wise Diffusion Models for High-Resolution Video GenerationHierarchical Spatio-temporal Decoupling for Text-to-Video GenerationLAMP: Learn A Motion Pattern for Few-Shot Video GenerationLodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance PrimitivesMagicAnimate: Temporally Consistent Human Image Animation using Diffusion ModelMake-Your-Anchor: A Diffusion-based 2D Avatar Generation FrameworkMake Your Dream A VlogMake Pixels Dance: High-Dynamic Video GenerationMicroCinema: A Divide-and-Conquer Approach for Text-to-Video GenerationPanacea: Panoramic and Controllable Video Generation for Autonomous DrivingPEEKABOO: Interactive Video Generation via Masked-DiffusionSeeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent AlignersSimDA: Simple Diffusion Adapter for Efficient Video GenerationSimple but Effective Text-to-Video Generation with Grid Diffusion ModelsStyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGANSyncTalk: The Devil is in the Synchronization for Talking Head SynthesisTune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video GenerationVideoBooth: Diffusion-based Video Generation with Image PromptsVideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
4.视频编辑(Video Editing)A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video EditingCAMEL: Causal Motion Enhancement tailored for lifting text-driven video editingCCEdit: Creative and Controllable Video Editing via Diffusion ModelsCoDeF: Content Deformation Fields for Temporally Consistent Video ProcessingFRESCO: Spatial-Temporal Correspondence for Zero-Shot Video TranslationRAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion ModelsVidToMe: Video Token Merging for Zero-Shot Video EditingVMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
5.3D生成(3D Generation/3D Synthesis)Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar ModelingBerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene GenerationCAD: Photorealistic 3D Generation via Adversarial DistillationCAGE: Controllable Articulation GEnerationCityDreamer: Compositional Generative Model of Unbounded 3D CitiesConsistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling PriorControlRoom3D: Room Generation using Semantic ControlsDanceCamera3D: 3D Camera Movement Synthesis with Music and DanceDiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View SynthesisDiffuScene: Denoising Diffusion Models for Generative Indoor Scene SynthesisDiffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic FeaturesDreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion ModelsDreamComposer: Controllable 3D Object Generation via Multi-View ConditionsDreamControl: Control-Based Text-to-3D Generation with 3D Self-PriorEmotional Speech-driven 3D Body Animation via Disentangled Latent DiffusionEscherNet: A Generative Model for Scalable View SynthesisGaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion ModelsGPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D GenerationGaussian Shell Maps for Efficient 3D Human GenerationHarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3DHIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video UnderstandingHolodeck: Language Guided Generation of 3D Embodied AI EnvironmentsHumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human GenerationInteractive3D: Create What You Want by Interactive 3D GenerationInterHandGen: Two-Hand Interaction Generation via Cascaded Reverse DiffusioIntrinsic Image Diffusion for Single-view Material EstimationMake-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from TextMoMask: Generative Masked Modeling of 3D Human MotionsEditable Scene Simulation for Autonomous Driving via LLM-Agent CollaborationEpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained DiffusionOED: Towards One-stage End-to-End Dynamic Scene Graph GenerationOne-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D DiffusionPEGASUS: Personalized Generative 3D Avatars with Composable AttributesPhysGaussian: Physics-Integrated 3D Gaussians for Generative DynamicsRichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion PriorsSceneWiz3D: Towards Text-guided 3D Scene CompositionSemCity: Semantic Scene Generation with Triplane DiffusionSherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D PriorSIGNeRF: Scene Integrated Generation for Neural Radiance FieldsSingle Mesh Diffusion Models with Field Latents for Texture GenerationSPAD : Spatially Aware Multiview DiffusersText-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priorsText-to-3D using Gaussian SplattingTiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion ProcessTowards Realistic Scene Generation with LiDAR Diffusion ModelsViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
6.3D编辑(3D Editing)GaussianEditor: Swift and Controllable 3D Editing with Gaussian SplattingMakeup Prior Models for 3D Facial Makeup Estimation and Applications
7.多模态大语言模型(Multi-Modal Large Language Models)Alpha-CLIP: A CLIP Model Focusing on Wherever You WantBoosting Continual Learning of Vision-Language Models via Mixture-of-Experts AdaptersCan Language Beat Numerical Regression? Language-Based Multimodal Trajectory PredictionChat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video UnderstandingCompositional Chain-of-Thought Prompting for Large Multimodal ModelsDual Memory Networks: A Versatile Adaptation Approach for Vision-Language ModelsEfficient Stitchable Task AdaptationEfficient Test-Time Adaptation of Vision-Language ModelsGenerative Multimodal Models are In-Context LearnersGLaMM: Pixel Grounding Large Multimodal ModelGPT4Point: A Unified Framework for Point-Language Understanding and GenerationInternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic TasksLet's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor GenerationLION : Empowering Multimodal Large Language Model with Dual-Level Visual KnowledgeLL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and PlanningMitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive DecodingMobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced TrainingMoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error MetricOneLLM: One Framework to Align All Modalities with LanguageOne Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language ModelsOPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-AllocationPanda-70M: Captioning 70M Videos with Multiple Cross-Modality TeachersPixelLM: Pixel Reasoning with Large Multimodal ModelPrompt Highlighter: Interactive Control for Multi-Modal LLMsQ-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation ModelsSC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language ModelsSEED-Bench: Benchmarking Multimodal Large Language ModelsUniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them AllVBench: Comprehensive Benchmark Suite for Video Generative ModelsVideoChat: Chat-Centric Video UnderstandingViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsViT-Lens: Towards Omni-modal Representations
8.其他多任务(Others)AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction ErrorDiff-BGM: A Diffusion Model for Video Background Music GenerationEvalCrafter: Benchmarking and Evaluating Large Video Generation Models
参考相关整理
Awesome-CVPR2024-AIGC
A Collection of Papers and Codes for CVPR2024 AIGC
整理汇总下今年CVPR AIGC相关的论文和代码,具体如下。
欢迎star,fork和PR~ 优先在Github更新:Awesome-CVPR2024-AIGC,欢迎star~ 知乎:https://zhuanlan.zhihu.com/p/684325134
参考或转载请注明出处
CVPR2024官网:https://cvpr.thecvf.com/Conferences/2024
CVPR完整论文列表:
开会时间:2024年6月17日-6月21日
论文接收公布时间:
【Contents】
1.图像生成(Image Generation/Image Synthesis)2.图像编辑(Image Editing)3.视频生成(Video Generation/Image Synthesis)4.视频编辑(Video Editing)5.3D生成(3D Generation/3D Synthesis)6.3D编辑(3D Editing)7.多模态大语言模型(Multi-Modal Large Language Model)8.其他多任务(Others)
1.图像生成(Image Generation/Image Synthesis)
Accelerating Diffusion Sampling with Optimized Time Steps
Paper: https://arxiv.org/abs/2402.17376Code: https://github.com/scxue/DM-NonUniform
Adversarial Text to Continuous Image Generation
Paper: https://openreview.net/forum?id=9X3UZJSGIg9
Amodal Completion via Progressive Mixed Context Diffusion
Paper: https://arxiv.org/abs/2312.15540Code: https://github.com/k8xu/amodal
Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder
Paper: https://arxiv.org/abs/2403.10255Code:
Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion
Paper: https://arxiv.org/abs/2312.12471Code: https://github.com/zkawfanx/Atlantis
CapHuman: Capture Your Moments in Parallel Universes
Paper: https://arxiv.org/abs/2402.18078Code: https://github.com/VamosC/CapHuman
CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization
Paper: https://arxiv.org/abs/2404.00521Code:
Check, Locate, Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation
Paper: https://arxiv.org/abs/2311.15773Code:
Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis
Paper: https://arxiv.org/abs/2402.00627Code: https://github.com/YanzuoLu/CFLD
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation
Paper: https://arxiv.org/abs/2310.01407Code: https://github.com/fast-codi/CoDi
Condition-Aware Neural Network for Controlled Image Generation
CosmicMan: A Text-to-Image Foundation Model for Humans
Paper: https://arxiv.org/abs/2404.01294Code: https://github.com/cosmicman-cvpr2024/CosmicMan
Countering Personalized Text-to-Image Generation with Influence Watermarks
Cross Initialization for Personalized Text-to-Image Generation
Paper: https://arxiv.org/abs/2312.15905Code:
Customization Assistant for Text-to-image Generation
Paper: https://arxiv.org/abs/2312.03045
DeepCache: Accelerating Diffusion Models for Free
Paper: https://arxiv.org/abs/2312.00858Code: https://github.com/horseee/DeepCache
DemoFusion: Democratising High-Resolution Image Generation With No $
Paper: https://arxiv.org/abs/2311.16973Code: https://github.com/PRIS-CV/DemoFusion
Desigen: A Pipeline for Controllable Design Template Generation
Paper: https://arxiv.org/abs/2403.09093Code: https://github.com/whaohan/desigen
DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model
Paper: Code:https://github.com/OpenGVLab/DiffAgent
Diffusion-driven GAN Inversion for Multi-Modal Facial Image Generation
DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models
Paper: https://arxiv.org/abs/2402.19481Code: https://github.com/mit-han-lab/distrifuser
Diversity-aware Channel Pruning for StyleGAN Compression
Paper: https://arxiv.org/abs/2403.13548Code: https://github.com/jiwoogit/DCP-GAN
Discriminative Probing and Tuning for Text-to-Image Generation
Paper: https://www.arxiv.org/abs/2403.04321Code: https://github.com/LgQu/DPT-T2I
Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation
Paper:Code: https://github.com/haofengl/DragNoise
DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization
Paper: https://arxiv.org/abs/2402.09812Code: https://github.com/KU-CVLAB/DreamMatcher
Dynamic Prompt Optimizing for Text-to-Image Generation
ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations
Paper: https://arxiv.org/abs/2312.04655Code: https://github.com/eclipse-t2i/eclipse-inference
Efficient Dataset Distillation via Minimax Diffusion
Paper: https://arxiv.org/abs/2311.15529Code: https://github.com/vimar-gu/MinimaxDiffusion
ElasticDiffusion: Training-free Arbitrary Size Image Generation
Paper: https://arxiv.org/abs/2311.18822Code: https://github.com/MoayedHajiAli/ElasticDiffusion-official
EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models
Paper: https://arxiv.org/abs/2401.04608Code: https://github.com/JingyuanYY/EmoGen
Enabling Multi-Concept Fusion in Text-to-Image Models
Paper:Code:
Exact Fusion via Feature Distribution Matching for Few-shot Image Generation
Paper:Code:
FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation
Paper: https://arxiv.org/abs/2403.06775Code:
Fast ODE-based Sampling for Diffusion Models in Around 5 Steps
Paper: https://arxiv.org/abs/2312.00094Code: https://github.com/zju-pi/diff-sampler
FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition
Paper: https://arxiv.org/abs/2312.07536Code: https://github.com/genforce/freecontrol
FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition
Paper:Code: https://github.com/aim-uofa/FreeCustom
Generalizable Tumor Synthesis
Paper: https://www.cs.jhu.edu/~alanlab/Pubs24/chen2024towards.pdfCode: https://github.com/MrGiovanni/DiffTumor
Generative Unlearning for Any Identity
Paper:Code: https://github.com/JJuOn/GUIDE
HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances
Paper: https://arxiv.org/abs/2403.01693
High-fidelity Person-centric Subject-to-Image Synthesis
Paper: https://arxiv.org/abs/2311.10329Code: https://github.com/CodeGoat24/Face-diffuser?tab=readme-ov-file
InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning
Paper: https://arxiv.org/abs/2304.03411
InstanceDiffusion: Instance-level Control for Image Generation
Paper: https://arxiv.org/abs/2402.03290Code: https://github.com/frank-xwang/InstanceDiffusion
Instruct-Imagen: Image Generation with Multi-modal Instruction
Paper: https://arxiv.org/abs/2401.01952Code:
Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models
Paper: https://arxiv.org/abs/2306.00973Code: https://github.com/haoningwu3639/StoryGen
InteractDiffusion: Interaction-Control for Text-to-Image Diffusion Model
Paper: https://arxiv.org/abs/2312.05849Code: https://github.com/jiuntian/interactdiffusion
Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models
Paper: https://arxiv.org/abs/2308.15692
Inversion-Free Image Editing with Natural Language
Paper: https://arxiv.org/abs/2312.04965Code: https://github.com/sled-group/InfEdit
JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation
LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion
Paper:Code: https://github.com/PanchengZhao/LAKE-RED
Learned representation-guided diffusion models for large-image generation
Paper: https://arxiv.org/abs/2312.07330
Learning Continuous 3D Words for Text-to-Image Generation
Paper: https://arxiv.org/abs/2402.08654Code: https://github.com/ttchengab/continuous_3d_words_code/
Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation
Paper: https://arxiv.org/abs/2311.15841Code:
Learning Multi-dimensional Human Preference for Text-to-Image Generation
LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model
Paper:Code: https://github.com/ewrfcas/LeftRefill
MACE: Mass Concept Erasure in Diffusion Models
Paper: https://arxiv.org/abs/2402.05408Code: https://github.com/Shilin-LU/MACE
MarkovGen: Structured Prediction for Efficient Text-to-Image Generation
Paper: https://arxiv.org/abs/2308.10997
MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant
Paper: https://arxiv.org/abs/2403.04290
MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis
Paper: https://arxiv.org/abs/2402.05408Code: https://github.com/limuloo/MIGC
MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation
On the Scalability of Diffusion-based Text-to-Image Generation
Personalized Residuals for Concept-Driven Text-to-Image Generation
PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding
Paper: https://arxiv.org/abs/2312.04461Code: https://github.com/TencentARC/PhotoMaker
PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis
Paper:Code: https://github.com/cszy98/PLACE
Prompt-Free Diffusion: Taking “Text” out of Text-to-Image Diffusion Models
Paper: https://arxiv.org/abs/2305.16223Code: https://github.com/SHI-Labs/Prompt-Free-Diffusion
Readout Guidance: Learning Control from Diffusion Features
Paper: https://arxiv.org/abs/2312.02150Code: https://github.com/google-research/readout_guidance
Relation Rectification in Diffusion Model
Paper:Code: https://github.com/WUyinwei-hah/RRNet
Residual Denoising Diffusion Models
Paper: https://arxiv.org/abs/2308.13712Code: https://github.com/nachifur/RDDM
Rethinking FID: Towards a Better Evaluation Metric for Image Generation
Paper: https://arxiv.org/abs/2401.09603Code: https://github.com/google-research/google-research/tree/master/cmmd
Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation
Paper: https://arxiv.org/abs/2311.13602Code: https://github.com/CyberAgentAILab/RALF
Rich Human Feedback for Text-to-Image Generation
Paper: https://arxiv.org/abs/2312.10240
SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation
Paper: https://arxiv.org/abs/2401.08053Code:
Self-correcting LLM-controlled Diffusion Models
Paper: https://arxiv.org/abs/2311.16090Code: https://github.com/tsunghan-wu/SLD
Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation
Paper: https://arxiv.org/abs/2311.17216
Shadow Generation for Composite Image Using Diffusion Model
Paper: https://arxiv.org/abs/2308.09972Code: https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2
Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models
Paper: https://arxiv.org/abs/2312.04410Code: https://github.com/SHI-Labs/Smooth-Diffusion
SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation
Paper: https://arxiv.org/abs/2312.16272Code: https://github.com/Xiaojiu-z/SSR_Encoder
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On
Paper: https://arxiv.org/abs/2312.01725Code: https://github.com/rlawjdghek/StableVITON
Structure-Guided Adversarial Training of Diffusion Models
Paper: https://arxiv.org/abs/2402.17563Code:
Style Aligned Image Generation via Shared Attention
Paper: https://arxiv.org/abs/2312.02133Code: https://github.com/google/style-aligned/
SVGDreamer: Text Guided SVG Generation with Diffusion Model
Paper: https://arxiv.org/abs/2312.16476Code: https://github.com/ximinng/SVGDreamer
Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting
Paper: https://arxiv.org/abs/2310.08129Code: https://github.com/zzjchen/Tailored-Visions
Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models
Paper: https://arxiv.org/abs/2403.08381Code: https://github.com/PangzeCheung/SingDiffusion
Taming Stable Diffusion for Text to 360∘ Panorama Image Generation
TextCraftor: Your Text Encoder Can be Image Quality Controller
Paper: https://arxiv.org/abs/2403.18978Code:
Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation
Paper: https://arxiv.org/abs/2403.06247
TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models
Paper: https://arxiv.org/abs/2311.16503Code: https://github.com/ModelTC/TFMQ-DM
TokenCompose: Grounding Diffusion with Token-level Supervision
Paper: https://arxiv.org/abs/2312.03626Code: https://github.com/mlpc-ucsd/TokenCompose
Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation
Paper: https://arxiv.org/abs/2403.05239
Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning
UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs
Paper: https://arxiv.org/abs/2311.09257
UniGS: Unified Representation for Image Generation and Segmentation
Paper: https://arxiv.org/abs/2312.01985Code: https://github.com/qqlu/Entity
ViewDiff: 3D-Consistent Image Generation with Text-To-Image Models
Paper: https://arxiv.org/abs/2403.01807Code: https://github.com/facebookresearch/ViewDiff
When StyleGAN Meets Stable Diffusion: a 풲+ Adapter for Personalized Image Generation
Paper: https://arxiv.org/abs/2311.17461Code: https://github.com/csxmli2016/w-plus-adapter
X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model
Paper: https://arxiv.org/abs/2312.02238Code: https://github.com/showlab/X-Adapter
2.图像编辑(Image Editing)
An Edit Friendly DDPM Noise Space: Inversion and Manipulations
Paper: https://arxiv.org/abs/2304.06140Code: https://github.com/inbarhub/DDPM_inversion
Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth
Paper:Code: https://github.com/Snowfallingplum/CSD-MT
Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing
Paper: https://arxiv.org/abs/2311.18608Code: https://github.com/HyelinNAM/ContrastiveDenoisingScore
DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations
Paper: https://arxiv.org/abs/2403.06951Code: https://github.com/Tianhao-Qi/DEADiff_code
Deformable One-shot Face Stylization via DINO Semantic Guidance
Paper: https://arxiv.org/abs/2403.00459Code: https://github.com/zichongc/DoesFS
DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection
Paper:Code: https://github.com/HansSunY/DiffAM
DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing
Paper: https://arxiv.org/abs/2312.07409Code: https://github.com/Kevin-thu/DiffMorpher
Diffusion Models Without Attention
Paper: https://arxiv.org/abs/2311.18257Code: https://github.com/Kevin-thu/DiffMorpher
Doubly Abductive Counterfactual Inference for Text-based Image Editing
Paper: https://arxiv.org/abs/2403.02981Code: https://github.com/xuesong39/DAC
Edit One for All: Interactive Batch Image Editing
Paper: https://arxiv.org/abs/2401.10219Code: https://github.com/thaoshibe/edit-one-for-all
Face2Diffusion for Fast and Editable Face Personalization
Paper: https://arxiv.org/abs/2403.05094Code: https://github.com/mapooon/Face2Diffusion
Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation
Paper: https://arxiv.org/abs/2312.10113Code: https://github.com/guoqincode/Focus-on-Your-Instruction
Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image
Paper: https://arxiv.org/abs/2403.09632Code: https://github.com/guoqincode/Focus-on-Your-Instruction
Inversion-Free Image Editing with Natural Language
Paper: hhttps://arxiv.org/abs/2312.04965Code: https://github.com/sled-group/InfEdit
PAIR-Diffusion: Object-Level Image Editing with Structure-and-Appearance Paired Diffusion Models
Paper: https://arxiv.org/abs/2303.17546Code: https://github.com/Picsart-AI-Research/PAIR-Diffusion
Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing
Paper: https://arxiv.org/abs/2303.17546Code: https://github.com/YangChangHee/CVPR2024_Person-In-Place_RELEASE?tab=readme-ov-file
PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models
Paper: https://arxiv.org/abs/2312.13964Code: https://github.com/open-mmlab/PIA
FreeDrag: Feature Dragging for Reliable Point-based Image Editing
Paper: https://arxiv.org/abs/2307.04684Code: https://github.com/LPengYang/FreeDrag
RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization
Paper: https://arxiv.org/abs/2403.00483Code:
SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models
Paper: https://arxiv.org/abs/2312.06739Code: https://github.com/TencentARC/SmartEdit
Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer
Paper: https://arxiv.org/abs/2312.09008Code: https://github.com/jiwoogit/StyleID
SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting
Paper: https://arxiv.org/abs/2402.18848Code:
Text-Driven Image Editing via Learnable Regions
Paper: https://arxiv.org/abs/2311.16432Code: https://github.com/yuanze-lin/Learnable_Regions
UniHuman: A Unified Model For Editing Human Images in the Wild
Paper: https://arxiv.org/abs/2312.14985Code: https://github.com/NannanLi999/UniHuman
3.视频生成(Video Generation/Video Synthesis)
360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model
Paper: https://arxiv.org/abs/2401.06578Code: https://github.com/Akaneqwq/360DVD
A Recipe for Scaling up Text-to-Video Generation with Text-free Videos
Paper: https://arxiv.org/abs/2312.15770Code: https://tf-t2v.github.io/
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model
Delving Deep into Diffusion Transformers for Image and Video Generation
DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation
DisCo: Disentangled Control for Realistic Human Dance Generation
Paper: https://arxiv.org/abs/2307.00040Code: https://github.com/Wangt-CN/DisCo
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Generation
Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models
Hierarchical Patch-wise Diffusion Models for High-Resolution Video Generation
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation
LAMP: Learn A Motion Pattern for Few-Shot Video Generation
Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation guided by the Characteristic Dance Primitives
Paper: https://arxiv.org/abs/2403.10518Code: https://github.com/li-ronghui/LODGE
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
Paper: https://arxiv.org/abs/2311.16498Code: https://github.com/magic-research/magic-animate
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework
Paper: https://arxiv.org/abs/2403.16510Code: https://github.com/ICTMCG/Make-Your-Anchor
Make Your Dream A Vlog
Paper: https://arxiv.org/abs/2401.09414Code: https://github.com/Vchitect/Vlogger
Make Pixels Dance: High-Dynamic Video Generation
MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation
Panacea: Panoramic and Controllable Video Generation for Autonomous Driving
Paper: https://arxiv.org/abs/2311.16813Code: https://github.com/wenyuqing/panacea
PEEKABOO: Interactive Video Generation via Masked-Diffusion
Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners
Paper: https://arxiv.org/abs/2308.13712Code: https://github.com/yzxing87/Seeing-and-Hearing
SimDA: Simple Diffusion Adapter for Efficient Video Generation
Simple but Effective Text-to-Video Generation with Grid Diffusion Models
StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN
Paper: https://arxiv.org/abs/2403.14186Code: https://github.com/jeolpyeoni/StyleCineGAN
SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis
Paper: https://arxiv.org/abs/2311.17590Code: https://github.com/ZiqiaoPeng/SyncTalk
Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation
Paper: https://arxiv.org/abs/2212.11565Code: https://github.com/showlab/Tune-A-Video
VideoBooth: Diffusion-based Video Generation with Image Prompts
Paper: https://arxiv.org/abs/2312.00777Code: https://github.com/Vchitect/VideoBooth
VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models
Paper: https://arxiv.org/abs/2401.09047Code: https://github.com/AILab-CVC/VideoCrafter
4.视频编辑(Video Editing)
A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing
Paper: https://arxiv.org/abs/2312.05856Code: https://github.com/STEM-Inv/stem-inv
CAMEL: Causal Motion Enhancement tailored for lifting text-driven video editing
Paper:Code: https://github.com/zhangguiwei610/CAMEL
CCEdit: Creative and Controllable Video Editing via Diffusion Models
Paper: https://arxiv.org/abs/2309.16496Code: https://github.com/RuoyuFeng/CCEdit
CoDeF: Content Deformation Fields for Temporally Consistent Video Processing
Paper: https://arxiv.org/abs/2308.07926Code: https://github.com/qiuyu96/CoDeF
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation
Paper:Code: https://github.com/williamyang1991/FRESCO/tree/main
RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models
Paper: https://arxiv.org/abs/2312.04524Code: https://github.com/rehg-lab/RAVE
VidToMe: Video Token Merging for Zero-Shot Video Editing
Paper: https://arxiv.org/abs/2312.10656Code: https://github.com/lixirui142/VidToMe
VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models
Paper: https://arxiv.org/abs/2312.00845Code: https://github.com/HyeonHo99/Video-Motion-Customization
5.3D生成(3D Generation/3D Synthesis)
Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling
Paper: https://arxiv.org/abs/2311.16096Code: https://github.com/lizhe00/AnimatableGaussians
BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation
Paper: https://arxiv.org/abs/2312.02136Code: https://github.com/zqh0253/BerfScene
CAD: Photorealistic 3D Generation via Adversarial Distillation
Paper: https://arxiv.org/abs/2312.06663Code: https://github.com/raywzy/CAD
CAGE: Controllable Articulation GEneration
Paper: https://arxiv.org/abs/2312.09570Code: https://github.com/3dlg-hcvc/cage
CityDreamer: Compositional Generative Model of Unbounded 3D Cities
Paper: https://arxiv.org/abs/2309.00610Code: https://github.com/hzxie/CityDreamer
Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior
Paper: https://arxiv.org/abs/2403.13667Code: https://github.com/sail-sg/Consistent3D
ControlRoom3D: Room Generation using Semantic Controls
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
Paper: https://arxiv.org/abs/2401.09050Code: https://github.com/Carmenw1203/DanceCamera3D-Official
DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
Paper: https://arxiv.org/abs/2312.13016Code: https://github.com/FreedomGu/DiffPortrait3D
DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis
Paper: https://arxiv.org/abs/2303.14207Code: https://github.com/tangjiapeng/DiffuScene
Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features
Paper: https://arxiv.org/abs/2311.17024Code: https://github.com/niladridutt/Diffusion-3D-Features
DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models
Paper: https://arxiv.org/abs/2304.00916Code: https://github.com/yukangcao/DreamAvatar
DreamComposer: Controllable 3D Object Generation via Multi-View Conditions
Paper: https://arxiv.org/abs/2312.03611Code: https://github.com/yhyang-myron/DreamComposer
DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior
Paper: https://arxiv.org/abs/2312.06439Code: https://github.com/tyhuang0428/DreamControl
Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion
Paper: https://arxiv.org/abs/2312.04466Code: https://github.com/kiranchhatre/amuse
EscherNet: A Generative Model for Scalable View Synthesis
Paper: https://arxiv.org/abs/2402.03908Code: https://github.com/hzxie/city-dreamer
GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models
Paper: https://arxiv.org/abs/2310.08529Code: https://github.com/hustvl/GaussianDreamer
GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation
Paper: https://arxiv.org/abs/2401.04092Code: https://github.com/3DTopia/GPTEval3D
Gaussian Shell Maps for Efficient 3D Human Generation
Paper: https://arxiv.org/abs/2311.17857Code: https://github.com/computational-imaging/GSM
HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D
Paper: https://arxiv.org/abs/2312.15980Code: https://github.com/byeongjun-park/HarmonyView
HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
Holodeck: Language Guided Generation of 3D Embodied AI Environments
Paper: https://arxiv.org/abs/2312.09067Code: https://github.com/allenai/Holodeck
HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation
Interactive3D: Create What You Want by Interactive 3D Generation
InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusio
Paper: https://arxiv.org/abs/2403.17422Code: https://github.com/jyunlee/InterHandGen
Intrinsic Image Diffusion for Single-view Material Estimation
Paper: https://arxiv.org/abs/2312.12274Code: https://github.com/Peter-Kocsis/IntrinsicImageDiffusion
Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text
Paper: https://arxiv.org/abs/2403.16897Code: https://github.com/junshutang/Make-It-Vivid
MoMask: Generative Masked Modeling of 3D Human Motions
Paper: https://arxiv.org/abs/2402.05746Code: https://github.com/EricGuo5513/momask-codes
Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration
Paper: https://arxiv.org/abs/2312.00063Code: https://github.com/yifanlu0227/ChatSim?tab=readme-ov-file
EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion
Paper: https://arxiv.org/abs/2312.06725Code: https://github.com/huanngzh/EpiDiff
OED: Towards One-stage End-to-End Dynamic Scene Graph Generation
One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion
PEGASUS: Personalized Generative 3D Avatars with Composable Attributes
Paper: https://arxiv.org/abs/2402.10636Code: https://github.com/snuvclab/pegasus
PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics
Paper: https://arxiv.org/abs/2311.12198Code: https://github.com/XPandora/PhysGaussian
RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D.
Paper: https://arxiv.org/abs/2311.16918Code: https://github.com/modelscope/richdreamer
SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors
Paper: https://arxiv.org/abs/2311.17261Code: https://github.com/daveredrum/SceneTex
SceneWiz3D: Towards Text-guided 3D Scene Composition
Paper: https://arxiv.org/abs/2312.08885Code: https://github.com/zqh0253/SceneWiz3D
SemCity: Semantic Scene Generation with Triplane Diffusion
Paper: https://arxiv.org/abs/2403.07773Code: https://github.com/zoomin-lee/SemCity?tab=readme-ov-file
Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior
Paper: https://arxiv.org/abs/2312.06655Code: https://github.com/liuff19/Sherpa3D
SIGNeRF: Scene Integrated Generation for Neural Radiance Fields
Single Mesh Diffusion Models with Field Latents for Texture Generation
SPAD : Spatially Aware Multiview Diffusers
Paper: https://arxiv.org/abs/2402.05235Code: https://github.com/yashkant/spad
Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors
Paper: https://arxiv.org/abs/2312.04963Code: https://github.com/BiDiff/bidiff
Text-to-3D using Gaussian Splatting
Paper: https://arxiv.org/abs/2309.16585Code: https://github.com/gsgen3d/gsgen
Tiger: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process
Paper:Code: https://github.com/Zhiyuan-R/Tiger-Diffusion
Towards Realistic Scene Generation with LiDAR Diffusion Models
Paper:Code: https://github.com/hancyran/LiDAR-Diffusion
ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models
Paper: https://arxiv.org/abs/2312.01305Code: https://github.com/ubc-vision/vivid123
6.3D编辑(3D Editing)
GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting
Paper: https://arxiv.org/abs/2311.14521Code: https://github.com/buaacyw/GaussianEditor
Makeup Prior Models for 3D Facial Makeup Estimation and Applications
Paper: https://arxiv.org/abs/2403.17761Code: https://github.com/YangXingchao/makeup-priors
7.多模态大语言模型(Multi-Modal Large Language Models)
Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
Paper: https://arxiv.org/abs/2312.03818Code: https://github.com/SunzeY/AlphaCLIP
Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters
Paper: https://arxiv.org/abs/2403.11549Code: https://github.com/JiazuoYu/MoE-Adapters4CL
Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction
Paper: https://arxiv.org/abs/2403.18447Code: https://github.com/InhwanBae/LMTrajectory
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Paper: https://arxiv.org/abs/2311.08046Code: https://github.com/PKU-YuanGroup/Chat-UniVi
Compositional Chain-of-Thought Prompting for Large Multimodal Models
Paper: https://arxiv.org/abs/2311.17076Code: https://github.com/chancharikmitra/CCoT
Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models
Paper: https://arxiv.org/abs/2403.17589Code: https://github.com/YBZh/DMN
Efficient Stitchable Task Adaptation
Paper: https://arxiv.org/abs/2311.17352Code: https://github.com/ziplab/Stitched_LLaMA
Efficient Test-Time Adaptation of Vision-Language Models
Paper: https://github.com/kdiAAA/TDACode: https://arxiv.org/abs/2403.18293
Generative Multimodal Models are In-Context Learners
Paper: https://arxiv.org/abs/2312.13286Code: https://github.com/baaivision/Emu/tree/main/Emu2
GLaMM: Pixel Grounding Large Multimodal Model
Paper: https://arxiv.org/abs/2311.03356Code: https://github.com/mbzuai-oryx/groundingLMM
GPT4Point: A Unified Framework for Point-Language Understanding and Generation
Paper: https://arxiv.org/abs/2312.02980Code: https://github.com/Pointcept/GPT4Point
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Paper: https://arxiv.org/abs/2312.14238Code: https://github.com/OpenGVLab/InternVL
Let’s Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation
Paper: https://arxiv.org/abs/2312.02439Code: https://github.com/sail-sg/CLoT
LION : Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge
Paper: https://arxiv.org/abs/2311.11860Code: https://github.com/rshaojimmy/JiuTian
LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding, Reasoning, and Planning
Paper: https://arxiv.org/abs/2311.18651Code: https://github.com/Open3DA/LL3DA
Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding
Paper: https://arxiv.org/abs/2311.16922Code: https://github.com/DAMO-NLP-SG/VCD
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
Paper: https://arxiv.org/abs/2311.17049Code: https://github.com/apple/ml-mobileclip
MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric
Paper: https://arxiv.org/abs/2403.07839Code:
OneLLM: One Framework to Align All Modalities with Language
Paper: https://arxiv.org/abs/2312.03700Code: https://github.com/csuhan/OneLLM
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
Paper: https://arxiv.org/abs/2403.01849Code: https://github.com/TreeLLi/APT
OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation
Paper: https://arxiv.org/abs/2402.19479Code: https://github.com/shikiw/OPERA
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Paper: https://arxiv.org/abs/2311.17911Code: https://github.com/snap-research/Panda-70M
PixelLM: Pixel Reasoning with Large Multimodal Model
Paper: https://arxiv.org/abs/2312.02228Code: https://github.com/MaverickRen/PixelLM
Prompt Highlighter: Interactive Control for Multi-Modal LLMs
Paper: https://arxiv.org/abs/2312.04302Code: https://github.com/dvlab-research/Prompt-Highlighter
Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models
Paper: https://arxiv.org/abs/2311.06783Code: https://github.com/Q-Future/Q-Instruct
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
Paper: https://arxiv.org/abs/2403.13263Code: https://github.com/ivattyue/SC-Tune
SEED-Bench: Benchmarking Multimodal Large Language Models
Paper: https://arxiv.org/abs/2311.17092Code: https://github.com/AILab-CVC/SEED-Bench
UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All
Paper: https://arxiv.org/abs/2403.12532Code:
VBench: Comprehensive Benchmark Suite for Video Generative Models
Paper: https://arxiv.org/abs/2311.17982Code: https://github.com/Vchitect/VBench
VideoChat: Chat-Centric Video Understanding
Paper: https://arxiv.org/abs/2305.06355Code: https://github.com/OpenGVLab/Ask-Anything
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Paper: https://arxiv.org/abs/2312.00784Code: https://github.com/mu-cai/ViP-LLaVA
ViT-Lens: Towards Omni-modal Representations
Paper: https://github.com/TencentARC/ViT-Lens Code: https://arxiv.org/abs/2308.10185
8.其他多任务(Others)
AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error
Paper: https://arxiv.org/abs/2401.17879Code: https://github.com/jonasricker/aeroblade
Diff-BGM: A Diffusion Model for Video Background Music Generation
EvalCrafter: Benchmarking and Evaluating Large Video Generation Models
Paper: https://arxiv.org/abs/2310.11440Code: https://github.com/evalcrafter/EvalCrafter
持续更新~
参考
CVPR 2024 论文和开源项目合集(Papers with Code)
相关整理
Awesome-AIGC-Research-GroupsAwesome-Low-Level-Vision-Research-GroupsAwesome-CVPR2024-CVPR2021-CVPR2020-Low-Level-VisionAwesome-ECCV2020-Low-Level-Vision
参考阅读
发表评论