Technical Architecture

How it works.

Our enterprise-grade background removal engine combines Dichotomous Image Segmentation (DIS) with Vision Transformer-based alpha matting to achieve state-of-the-art edge accuracy at production speed.

01 — Model Architecture

Hybrid DIS + Matting network.

We combine two complementary state-of-the-art approaches into a unified pipeline that delivers both robust object detection and fine-grained alpha channel resolution.

Input Image (any resolution)

1 Coarse Segmentation — IS-Net (DIS-v5.0)

• Intermediate supervision with GT Encoder
• Multi-scale feature extraction
• Output: Binary mask + auto-generated trimap

2 Alpha Matting — ViTMatte

• Vision Transformer backbone
• Detail Capture Module (DCM) for fine edges
• Hybrid attention mechanism
• Input: Original image + trimap from Stage 1
• Output: High-precision alpha matte

3 Post-Processing

• Guided filter edge refinement
• Adaptive edge feathering
• Background color decontamination
• Alpha compositing & output encoding

Output: RGBA PNG / WebP

Stage 1: IS-Net Dichotomous Segmentation

IS-Net introduces intermediate supervision through a novel GT (Ground Truth) Encoder that guides the segmentation network to focus on structurally accurate predictions. Unlike standard SOD models, DIS is specifically trained on the DIS5K dataset containing high-resolution images with intricate object boundaries.

The model generates a robust binary segmentation mask which is then automatically converted into a trimap — a three-region map that identifies definite foreground, definite background, and the critical "uncertainty zone" where alpha matting will resolve the fine details.

Stage 2: ViTMatte alpha matting.

ViTMatte leverages a plain Vision Transformer (ViT) backbone with a lightweight Detail Capture Module (DCM) to produce high-quality alpha mattes. The DCM injects fine-grained spatial details back into the ViT's feature maps, which is crucial for resolving complex boundaries like hair strands, fur, and semi-transparent materials.

The hybrid attention mechanism combines global context from the transformer with local detail from convolutional layers, enabling the model to handle both large structural boundaries and fine-detail edges in a single forward pass.

02 — Edge Cases

Handling the hard cases.

Enterprise-grade background removal means handling cases that break simpler tools — low-contrast boundaries, motion blur, transparency, and complex textures.

Low-contrast boundaries

Multi-scale feature extraction with contrast-adaptive normalization. The network learns boundary detection independent of local contrast ratios.

Hair & fur detail

Trimap-guided matting with learned affinity propagation resolves individual strands without the "halo" artifacts common in threshold-based tools.

Semi-transparent objects

Alpha-aware loss function with glass, smoke, and veil training data augmentation produces true partial transparency rather than binary cutoffs.

Motion blur

Deblurring pre-processing with adaptive Wiener filter stabilizes boundary detection before segmentation, maintaining accuracy on action shots.

03 — Performance

Under 2 seconds. Every time.

Speed without sacrifice. Our optimization stack ensures enterprise-grade latency without compromising edge quality.

Model Optimization

→ INT8 quantization for segmentation model
→ FP16 precision for alpha matting (preserves accuracy)
→ ONNX Runtime with TensorRT execution provider

Infrastructure

→ Dynamic batching with adaptive resolution tiling
→ CDN-edge inference nodes (<100ms network latency)
→ WebAssembly fallback for client-side processing

See it in action.

Try our background removal engine free — no sign-up required.

Try It Now