Diffusion Models are Secretly
Zero-Shot 3DGS Harmonizers

CVLAB, EPFL *Equal contribution
Transactions on Machine Learning Research 2026
EPFL logo CVLAB logo

Abstract

Gaussian Splatting has become a popular technique for various 3D Computer Vision tasks, including novel view synthesis, scene reconstruction, and dynamic scene rendering. However, the challenge of natural-looking object insertion, where the object's appearance seamlessly matches the scene, remains unsolved. In this work, we propose a method, dubbed D3DR, for inserting a 3DGS-parametrized object into a 3DGS scene while correcting its lighting, shadows, and other visual artifacts to ensure consistency. We reveal a hidden ability of diffusion models trained on large real-world datasets to implicitly understand correct scene lighting, and leverage it in our pipeline. After inserting the object, we optimize a diffusion-based Delta Denoising Score (DDS)-inspired objective to adjust its 3D Gaussian parameters for proper lighting correction. We introduce a novel diffusion personalization technique that preserves object geometry and texture across diverse lighting conditions, and utilize it to achieve consistent identity matching between original and inserted objects. Finally, we demonstrate the effectiveness of the method by comparing it to existing approaches, achieving 2.0 dB PSNR improvements in relighting quality.

Teaser Image

Overview of the task: Our method aims to insert a 3DGS object into a specific location in a 3DGS scene, followed by adjusting the object's appearance to match the scene's lighting. The final result is also a new 3DGS scene that includes both the input scene and the object with realistic lighting.

Qualitative Results

Interactive 360° views of our relighting results. Drag left or right inside each widget to rotate the camera. Use the buttons to switch between methods.

Chair in Living Room

↔ Drag to rotate

Office

↔ Drag to rotate

Comparisons

We compare our method (D3DR) against three baselines across synthetic and real-world scenes: CopyPaste (direct background substitution), iGS2GS (Gaussian splatting with diffusion-based editing), and Latent Bridge (diffusion bridge matching in latent space). Our approach produces significantly more consistent relighting, preserving scene-specific shading, shadows, and material appearance under diverse illumination conditions.

🔍 Click and hold any video to zoom in.

Quantitative Results

We evaluate D3DR against five baselines across both synthetic and real-world datasets. Baselines include Copy-Paste (direct placement without harmonization), TIP-Editor (SDS-based 3DGS object generation), Relightable 3D Gaussians (R3DG) (physics-based relighting via BRDF decomposition), Latent Bridge Matching (LBM) (state-of-the-art 2D harmonization lifted to 3D), and Instruct-GS2GS (InstructPix2Pix-based 3DGS editing). Bold values indicate the best result per metric.

Table 1 — Comparison with Baselines

PSNRpart — PSNR over object pixels only PSNRcropped — PSNR within object bounding box SSIMcropped — SSIM within object bounding box CTIS — CLIP text–image similarity DTIS — CLIP transformation alignment
Dataset Metric D3DR (Ours) Copy-Paste LBM TIP-Editor R3DG Instruct-GS2GS
Synthetic Dataset
Synthetic PSNRpart 11.966 6.51910.0756.9608.5986.892
PSNRcropped 18.039 13.03216.27112.50214.45413.360
SSIMcropped 0.640 0.5820.6380.4390.4490.526
CTIS ↑ 0.646 0.6420.6430.6190.6390.644
DTIS ↑ 0.529 0.5290.5260.5070.5270.529
Real-World Dataset
Real CTIS ↑ 0.643 0.6380.6380.6250.6230.641
DTIS ↑ 0.510 0.5050.5060.4970.5010.510

D3DR achieves the best PSNRpart and PSNRcropped on the synthetic dataset and the best CTIS and DTIS on the real dataset. PSNR and SSIM are not reported for real-world data due to inaccuracies in ground-truth camera poses. (↑) higher is better.

Table 2 — Efficiency Comparison

Metric D3DR (Ours) Copy-Paste LBM TIP-Editor R3DG Instruct-GS2GS
Training Time (min) ↓ 40 02414018537
Storage (GB) ↓ 0.076 0.0760.0760.0970.9550.076
NGaussians (×106) ↓ 0.330 0.3300.3301.8701.9700.330

D3DR trains nearly 3× faster than TIP-Editor and R3DG while using far less storage and fewer Gaussians. Copy-Paste requires no training but performs no harmonization. (↓) lower is better.

Table 3 — Shadow & Background Metrics (Synthetic Dataset)

PSNRshadow — PSNR on pixels near the object (within 1.20× bounding box), object excluded PSNRbg — PSNR on background pixels SSIMshadow — SSIM on shadow region SSIMbg — SSIM on background pixels
Metric D3DR (Ours) Copy-Paste LBM TIP-Editor R3DG Instruct-GS2GS
PSNRshadow 19.993 19.02319.54411.89914.89917.923
PSNRbackground 23.383 24.16923.53316.78615.23220.238
SSIMshadow 0.890 0.9240.8940.7680.7380.802
SSIMbackground 0.910 0.9480.9020.8150.6790.780

D3DR achieves the best PSNRshadow, confirming accurate shadow generation. Copy-Paste scores highest on background metrics because it makes no modifications to the scene. (↑) higher is better.

📌 All PSNR and SSIM metrics are averaged across all scenes within each dataset. CLIP-based metrics (CTIS, DTIS) are normalized to [0, 1].

BibTeX

@article{
  skorokhodov2026diffusion,
  title={Diffusion Models are Secretly Zero-Shot 3{DGS} Harmonizers},
  author={Vsevolod Skorokhodov and Nikita Durasov and Pascal Fua},
  journal={Transactions on Machine Learning Research},
  issn={2835-8856},
  year={2026},
  url={https://openreview.net/forum?id=1jjIitxVmM},
  note={}
}