DETR: End-to-End Object Detection with Transformers
Introduction
Traditional object detection methods rely on multiple hand-designed components, such as anchor generation and non-maximum suppression, which explicitly encode prior knowledge about the task. DETR (DEtection TRansformer) revolutionizes this approach by treating object detection as a direct set prediction problem, eliminating the need for these manual steps.
DETR leverages the global reasoning capability of attention mechanisms to detect large objects effectively. However it struggles with small objects where capturing finer local details is essential.
Architecture
DETR processes an image through a CNN backbone to extract features, which are then partitioned into patches. Since Transformers are permutation invariant and do not inherently understand spatial relationships, positional encodings composed of sinusoids at different frequencies are added to retain 2D structure information. These encoded features pass through a Transformer Encoder and object queries are fed into the Transformer Decoder. Finally, a feed-forward network (FFN) predicts object classes and bounding box coordinates. The Hungarian algorithm is used to match predicted and ground truth bounding boxes.
CNN Backbone
DETR uses a ResNet backbone to extract 2048 feature maps from the input image, reducing its resolution to 1/32 of the original height and width. These feature maps are flattened into a sequence of embeddings, which serve as input to the Transformer.
Transformer Encoder
To process high-level image features, DETR applies a 1 x 1 convolution, reducing the channel dimension from C to d. Since the Transformer Encoder requires a sequential input, the spatial dimensions are flattened into a d x HW feature map. Each encoder layer follows the standard Transformer architecture, consisting of multi-head self-attention and a feed-forward network (FFN). Because Transformers are permutation-invariant, fixed positional encodings are added to retain spatial structure across attention layers.
Transformer Decoder
In DETR, N object queries are learned during training and remain fixed throughout inference. These queries set to N = 100, allow the model to detect a maximum of 100 objects per image. These queries serve as input embeddings and are transformed into output embeddings by the decoder. Since Transformers are permutation-invariant, the object queries include learned positional encodings to ensure distinct predictions.
Each decoder layer applies self-attention and encoder-decoder attention, enabling DETR to globally reason about all objects simultaneously, capturing pairwise relationships while leveraging the entire image as context.
Feed Forward Network
DETR’s final predictions are generated by a 3-layer perceptron featuring ReLU activation and a hidden dimension of d, followed by a linear projection layer. The FFN outputs the normalized center coordinates, height, and width of the bounding box relative to the input image. Additionally, a softmax function is used to assign a class label.
Given that DETR predicts a fixed-size set of N bounding boxes — often exceeding the actual number of objects in an image — an extra special class label (∅) is included to indicate slots where no object is detected. This label functions similarly to the “background” class in traditional object detection methods.
Auxiliary decoding losses
To improve training stability and help DETR predict the correct number of objects per class, auxiliary losses are added after each decoder layer. Specifically, prediction FFNs and Hungarian Loss are applied at every decoder layer, with all FFNs sharing parameters for consistency. Additionally, a shared layer normalization is used to standardize inputs across different decoder layers, ensuring consistent learning and better convergence.
Loss Function
Since DETR produces a fixed set of N predictions per image with N greater than the number of objects per image, it requires a way to match predictions to ground truth objects. This is achieved using the Hungarian algorithm, which finds an optimal bipartite matching between predicted and ground truth objects based on a matching cost.
The matching cost considers both class prediction and bounding box similarity. Predictions are assigned to ground truth objects in a one-to-one manner, unlike traditional detectors that use heuristics for anchor-based assignments. Once matching is determined, the Hungarian Loss is computed as a combination of:
- Negative Log Likelihood for class prediction.
- Bounding Box Regression Loss as a combination of L1 loss between predicted and true bbox coordinates and Generalized IoU Loss
To handle class imbalance, the log-probability term for empty (∅) slots is down-weighted by a factor of 10, similar to how Faster R-CNN balances positive and negative proposals. Using probabilities instead of log-probabilities in the matching cost ensures consistency with the bounding box loss, leading to better empirical performance.
Training
DETR is trained using the AdamW optimizer with an initial learning rate of KaTeX can only parse string typed expression for the Transformer and KaTeX can only parse string typed expression for the backbone, along with a weight decay of KaTeX can only parse string typed expression. The Transformer weights are initialized using Xavier initialization, while the backbone is initialized with an ImageNet-pretrained ResNet model with frozen batch normalization layers. Two backbones are used: ResNet-50 (DETR) and ResNet-101 (DETR-R101).
Data augmentation techniques include scale augmentation (resizing the shortest image side between 480 and 800 pixels, and the longest up to 1333 pixels) and random cropping, which improves performance by approximately 1 AP. The Transformer uses a default dropout of 0.1 during training.
Training occurs over 300 epochs with a learning rate drop by a factor of 10 after 200 epochs. The baseline model is trained on 16 V100 GPUs with a batch size of 64 and takes about three days. A longer 500-epoch schedule with a learning rate drop at 400 epochs adds 1.5 AP compared to the shorter schedule.
Visualization of Attention Mechanisms
- Object Queries: In the image below, we can see the visualization of the center of each predicted bounding box in the COCO validation dataset for the first 20 learned object queries. The points are color-coded: green represents small boxes, red indicates large horizontal boxes and blue corresponds to large vertical boxes. Each slot specializes in specific regions and object sizes. Notably, most slots predict large, image-wide bounding boxes, a pattern frequently observed in the COCO dataset.
- Encoder Self Attention: The encoder self-attention mechanism focuses on a set of reference points, allowing it to effectively separate individual object instances. This enables DETR to distinguish different objects within the image by attending to their unique spatial and contextual features. The predictions shown are made using the baseline DETR model on a validation set image, demonstrating its ability to capture object boundaries and relationships through self-attention.
- Decoder Self Attention: The visualization shows decoder attention for each predicted object using the DETR-DC5 model. The decoder tends to focus on key object extremities such as legs and heads.
DETR for Panoptic Segmentation
Panoptic segmentation is an extension of object detection that classifies every pixel in an image, distinguishing between foreground thing objects (e.g., people, cars) and background stuff (e.g., sky, road). DETR can be adapted for panoptic segmentation by incorporating a segmentation head that predicts binary masks for detected objects. The segmentation head consists of:
- Multi-head attention mechanism: Generates per-instance feature maps.
- FPN-style CNN: Upscales feature maps to full resolution.
- Pixel-wise argmax operation: Assigns each pixel to a detected object.
DETR’s panoptic segmentation extends the Hungarian Loss to include a mask prediction loss. The final loss is a combination of:
Conclusion
DETR has redefined object detection by leveraging transformers to eliminate the need for traditional hand-crafted components like anchors and non-maximum suppression. By formulating object detection as a set prediction problem, DETR provides a simple yet effective end-to-end framework capable of directly predicting object classes and bounding boxes.
As transformer architectures continue to evolve, DETR remains a foundational model that has set the stage for the next generation of object detection methods.