Technical Architecture

How it works.

Our enterprise-grade background removal engine combines Dichotomous Image Segmentation (DIS) with Vision Transformer-based alpha matting to achieve state-of-the-art edge accuracy at production speed.

01 — Model Architecture

Hybrid DIS + Matting network.

We combine two complementary state-of-the-art approaches into a unified pipeline that delivers both robust object detection and fine-grained alpha channel resolution.

Input Image (any resolution)
1 Coarse Segmentation — IS-Net (DIS-v5.0)
  • • Intermediate supervision with GT Encoder
  • • Multi-scale feature extraction
  • • Output: Binary mask + auto-generated trimap
2 Alpha Matting — ViTMatte
  • • Vision Transformer backbone
  • • Detail Capture Module (DCM) for fine edges
  • • Hybrid attention mechanism
  • • Input: Original image + trimap from Stage 1
  • • Output: High-precision alpha matte
3 Post-Processing
  • • Guided filter edge refinement
  • • Adaptive edge feathering
  • • Background color decontamination
  • • Alpha compositing & output encoding
Output: RGBA PNG / WebP

Stage 1: IS-Net Dichotomous Segmentation

IS-Net introduces intermediate supervision through a novel GT (Ground Truth) Encoder that guides the segmentation network to focus on structurally accurate predictions. Unlike standard SOD models, DIS is specifically trained on the DIS5K dataset containing high-resolution images with intricate object boundaries.

The model generates a robust binary segmentation mask which is then automatically converted into a trimap — a three-region map that identifies definite foreground, definite background, and the critical "uncertainty zone" where alpha matting will resolve the fine details.

Stage 2: ViTMatte alpha matting.

ViTMatte leverages a plain Vision Transformer (ViT) backbone with a lightweight Detail Capture Module (DCM) to produce high-quality alpha mattes. The DCM injects fine-grained spatial details back into the ViT's feature maps, which is crucial for resolving complex boundaries like hair strands, fur, and semi-transparent materials.

The hybrid attention mechanism combines global context from the transformer with local detail from convolutional layers, enabling the model to handle both large structural boundaries and fine-detail edges in a single forward pass.

02 — Edge Cases

Handling the hard cases.

Enterprise-grade background removal means handling cases that break simpler tools — low-contrast boundaries, motion blur, transparency, and complex textures.

Low-contrast boundaries

Multi-scale feature extraction with contrast-adaptive normalization. The network learns boundary detection independent of local contrast ratios.

Hair & fur detail

Trimap-guided matting with learned affinity propagation resolves individual strands without the "halo" artifacts common in threshold-based tools.

Semi-transparent objects

Alpha-aware loss function with glass, smoke, and veil training data augmentation produces true partial transparency rather than binary cutoffs.

Motion blur

Deblurring pre-processing with adaptive Wiener filter stabilizes boundary detection before segmentation, maintaining accuracy on action shots.

03 — Performance

Under 2 seconds. Every time.

Speed without sacrifice. Our optimization stack ensures enterprise-grade latency without compromising edge quality.

Model Optimization
  • INT8 quantization for segmentation model
  • FP16 precision for alpha matting (preserves accuracy)
  • ONNX Runtime with TensorRT execution provider
Infrastructure
  • Dynamic batching with adaptive resolution tiling
  • CDN-edge inference nodes (<100ms network latency)
  • WebAssembly fallback for client-side processing

See it in action.

Try our background removal engine free — no sign-up required.

Try It Now