Inpainting, Restoration, Editing

A comprehensive overview of image manipulation techniques

Last Updated: January 17, 2026

Introduction

In the context of diffusion models, Inpainting, Restoration, and Editing are three overlapping domains that share the same underlying generative “engine” but differ significantly in their goals, constraints, and mathematical implementations.

While all three tasks leverage the powerful denoising capabilities of diffusion models, they each address distinct challenges:

  • Inpainting focuses on filling “holes” or replacing specific regions
  • Restoration aims to remove degradations and recover clean images
  • Editing seeks to modify semantic content or style according to user instructions

This blog post explores how diffusion models unify these tasks while highlighting their unique technical approaches and mathematical foundations.

Table of Contents

  1. The Core Definitions
  2. Technical Differences
  3. The Connections
  4. Summary Comparison
  5. Applications
  6. Future Directions

The Core Definitions

The three domains can be distinguished by their primary goals, user inputs, and evaluation metrics:

Feature Inpainting Restoration Editing
Primary Goal Filling “holes” or replacing specific regions. Removing degradations (noise, blur, scratches). Modifying semantic content or style.
User Input A binary mask + (often) a text prompt. A degraded image (blurry, noisy, low-res). A text instruction or prompt change.
Key Metric Local coherence and edge blending. Fidelity to the original “clean” version. Instruction following while preserving identity.

Technical Differences

While all three tasks leverage diffusion models, they employ different architectural modifications and mathematical frameworks to achieve their specific goals.

Inpainting: Localized Generation

Inpainting focuses on generating new pixels for a specific area while keeping the rest of the image (the “background”) untouched.

Specialized Architectures: Many models (like SD-Inpainting) use a modified U-Net with 5 additional input channels: 4 for the encoded masked image and 1 for the mask itself. This allows the model to “see” exactly where the hole is and what context surrounds it.

Zero-Shot (Blending): Techniques like Blended Diffusion don’t require a special model. Instead, at each denoising step, they take the generated noise for the masked area and “stitch” it onto the original image’s noise for the unmasked area. This approach enables inpainting with any pre-trained diffusion model without fine-tuning.

Restoration: Solving Inverse Problems

Restoration is often treated as a linear inverse problem ($y = Ax + n$), where $y$ is the degraded image and $x$ is the clean one you want to find.

Mathematical Approach: Methods like DDRM (Denoising Diffusion Restoration Models) or DDNM leverage pre-trained diffusion models by ensuring the generated result remains mathematically consistent with the degraded input (using techniques like Range-Null space decomposition). This ensures that the restored image, when degraded again, matches the original degraded input.

Low-Level Focus: Restoration prioritizes recovering lost data (like sharpness in super-resolution) rather than creating entirely new objects. The goal is fidelity to what the image “should have been” before degradation, not creative generation.

Editing: Semantic Transformation

Editing is the broadest category. It aims to change the “meaning” of an image (e.g., “make the cat a tiger”).

Mask-Free Editing: Unlike inpainting, modern editing (e.g., InstructPix2Pix, Prompt-to-Prompt) often doesn’t use a mask. Instead, it manipulates cross-attention maps. By freezing the attention maps of the original image, the model can change the “subject” while keeping the “composition” (pose, lighting, background) identical.

Inversion: For real images, editing requires DDIM Inversion, which “reverses” the image back into a specific noise latent so that modifications can be applied during the re-generation process. This allows precise control over which parts of the image change while preserving others.

The Connections

The three tasks are deeply intertwined and often use each other as building blocks:

1. Inpainting as a Tool for Editing

Most complex editing workflows use inpainting as the final step. For example, DiffEdit automatically generates a mask by comparing the original and new prompts, then uses inpainting to swap the object within that mask. This approach combines the semantic understanding of editing with the spatial precision of inpainting.

2. Inpainting as a Subset of Restoration

Technically, inpainting is a form of restoration where the “degradation” is a 100% loss of pixels in a specific area. Restoration models (like DDRM) can often perform inpainting, colorization, and super-resolution using the same mathematical framework. The key insight is that all these tasks can be formulated as inverse problems with different degradation operators $A$.

3. Restoration as a Foundation for Generation

Face restoration (e.g., CodeFormer or GFPGAN) is frequently used as a post-processing step for both inpainting and editing to fix the “uncanny” or distorted faces that diffusion models sometimes produce. This demonstrates how restoration techniques enhance the quality of generative tasks.

Summary Comparison

The three domains can be succinctly characterized as:

  • Inpainting is about where to change (defined by a mask).
  • Restoration is about what to fix (defined by the degradation).
  • Editing is about what to become (defined by the instruction).

Despite their differences, all three leverage the same powerful generative capabilities of diffusion models, unified by the shared goal of producing high-quality, realistic images that meet specific user requirements.

Applications

These techniques find applications in:

  • Photography: Remove unwanted objects, restore old photos
  • Film and media: Visual effects and post-production
  • Medical imaging: Enhance diagnostic images
  • Forensics: Restore and analyze evidence
  • Art and design: Creative image manipulation

Future Directions

The field continues to evolve with:

  • Real-time processing: Faster inference for interactive applications
  • Better control: More precise manipulation capabilities
  • Multimodal understanding: Integration of text, images, and other modalities
  • Efficiency: Smaller models with comparable performance

References

This section will be populated with relevant citations as the blog post is developed.