In Computer Vision, Object Detection is very essential problem. We had a lot of architecture: Dense, CNN, Transformer.... Today, we'll deep dive into it.
Transformer Based
Anchor-based approach of Yolo can improve speed, but we need NMS and some post process for noise filer
It's waste time in tuning. So we can use Transformer architecture for end to end training and inference.
Architecture
Output:
The classification logits for each query, which predict the probability of each class for the detected objects, including a "no-object" class
The coordinates of the predicted bounding boxes normalized to the size of the image
aux_outputs: Additional outputs from each transformer decoder layer, used when auxiliary losses are activated
MLP (Multi-Layer Perceptron): A simple fully connected network used within the DETR model to process features (e.g., to compute bounding box coordinates).
QKV:
Q: using embedding to represent objects
K and V: projected to KV by linear layers. Key for correlations between query and position in image.
V offer info to update status of query in loop of transformer.
Metrics/Loss
SetCriterion: This class calculates several losses for training the DETR model. It uses:
A Hungarian matcher to associate predicted boxes and classes with ground truth boxes and classes.
Loss calculations for classification (loss_ce), bounding box regression (loss_bbox), GIoU (loss_giou), and optionally, mask losses (loss_mask, loss_dice) if segmentation masks are being predicted.
Cardinality loss (loss_cardinality) which measures the error in the number of predicted objects versus the actual number of objects.
CNN
Architecture
Using Convol technique with kernel for scaning, and then, learning feature from images.
In Yolo, we have Anchor-based technique to stable learning. Model classify anchor box and minimize IoU.
Bottle-neck block
Spatial Pyramid Pooling - Fast
Spatial Pyramid Pooling - Fast
Spatial Pyramid Pooling - Fast
SiLU
Batch Normalization
Max Pooling layers
Assorted hyperparameters
IOU thresholds và loss functions
Attention Mechanism
Data Augumentation
Using Mosaic Mixup Augumentation to avoid overfiting and improve accuracy in test.
Loss
Varifocal loss
BCE loss
CE loss
Bbox loss
RotatedBboxLoss
KeypointLoss
Initial technique
Using weight init to set bias conv detect = -1 to almost anchor output will overlapse with groundtruth.
Set bias conv classify = -3 to accuracy from start, is background, only learn positive
Mask R-CNN/Faster R-CNN
Architecture
DETR and Faster R-CNN have the same params. Faster R-CNN better mAP than DETR in small object. But normal object is better.
R is Region Proposal Network, and then, Classify and regression.
Mask is a segmentation to offer RoIPool. We need to fine tune Mask in 1st phase. After that, Object Det fine-tuning
Using Feature Pyramid Networks (FPNs).
Loss
Classification Loss: BCE or CE. (multilabel or unilabel)
Mask Loss: Binary Cross-Entropy for every pixel.
BBox Loss: Smooth L1 Loss (regression loss)
Originally published May 21, 2021
Latest update April 22, 2021