How it works.
Our enterprise-grade background removal engine combines Dichotomous Image Segmentation (DIS) with Vision Transformer-based alpha matting to achieve state-of-the-art edge accuracy at production speed.
Hybrid DIS + Matting network.
We combine two complementary state-of-the-art approaches into a unified pipeline that delivers both robust object detection and fine-grained alpha channel resolution.
- • Intermediate supervision with GT Encoder
- • Multi-scale feature extraction
- • Output: Binary mask + auto-generated trimap
- • Vision Transformer backbone
- • Detail Capture Module (DCM) for fine edges
- • Hybrid attention mechanism
- • Input: Original image + trimap from Stage 1
- • Output: High-precision alpha matte
- • Guided filter edge refinement
- • Adaptive edge feathering
- • Background color decontamination
- • Alpha compositing & output encoding
Stage 1: IS-Net Dichotomous Segmentation
IS-Net introduces intermediate supervision through a novel GT (Ground Truth) Encoder that guides the segmentation network to focus on structurally accurate predictions. Unlike standard SOD models, DIS is specifically trained on the DIS5K dataset containing high-resolution images with intricate object boundaries.
The model generates a robust binary segmentation mask which is then automatically converted into a trimap — a three-region map that identifies definite foreground, definite background, and the critical "uncertainty zone" where alpha matting will resolve the fine details.
Stage 2: ViTMatte alpha matting.
ViTMatte leverages a plain Vision Transformer (ViT) backbone with a lightweight Detail Capture Module (DCM) to produce high-quality alpha mattes. The DCM injects fine-grained spatial details back into the ViT's feature maps, which is crucial for resolving complex boundaries like hair strands, fur, and semi-transparent materials.
The hybrid attention mechanism combines global context from the transformer with local detail from convolutional layers, enabling the model to handle both large structural boundaries and fine-detail edges in a single forward pass.
Handling the hard cases.
Enterprise-grade background removal means handling cases that break simpler tools — low-contrast boundaries, motion blur, transparency, and complex textures.
Low-contrast boundaries
Multi-scale feature extraction with contrast-adaptive normalization. The network learns boundary detection independent of local contrast ratios.
Hair & fur detail
Trimap-guided matting with learned affinity propagation resolves individual strands without the "halo" artifacts common in threshold-based tools.
Semi-transparent objects
Alpha-aware loss function with glass, smoke, and veil training data augmentation produces true partial transparency rather than binary cutoffs.
Motion blur
Deblurring pre-processing with adaptive Wiener filter stabilizes boundary detection before segmentation, maintaining accuracy on action shots.
Under 2 seconds. Every time.
Speed without sacrifice. Our optimization stack ensures enterprise-grade latency without compromising edge quality.
- → INT8 quantization for segmentation model
- → FP16 precision for alpha matting (preserves accuracy)
- → ONNX Runtime with TensorRT execution provider
- → Dynamic batching with adaptive resolution tiling
- → CDN-edge inference nodes (<100ms network latency)
- → WebAssembly fallback for client-side processing