Object Detection¶

This package lists contributed object detection models.

Faster R-CNN¶

class pl_bolts.models.detection.faster_rcnn.faster_rcnn_module.FasterRCNN(learning_rate=0.0001, num_classes=91, backbone=None, fpn=True, pretrained=False, pretrained_backbone=True, trainable_backbone_layers=3, **kwargs)[source]

Bases: pytorch_lightning.

PyTorch Lightning implementation of Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.

Paper authors: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

Model implemented by:

Teddy Koker <https://github.com/teddykoker>

During training, the model expects both the input tensors, as well as targets (list of dictionary), containing:

boxes (FloatTensor[N, 4]): the ground truth boxes in [x1, y1, x2, y2] format.
labels (Int64Tensor[N]): the class label for each ground truh box

CLI command:

# PascalVOC
python faster_rcnn_module.py --gpus 1 --pretrained True

Parameters

learning_rate¶ (float) – the learning rate
num_classes¶ (int) – number of detection classes (including background)
backbone¶ (Union[str, Module, None]) – Pretained backbone CNN architecture or torch.nn.Module instance.
fpn¶ (bool) – If True, creates a Feature Pyramind Network on top of Resnet based CNNs.
pretrained¶ (bool) – if true, returns a model pre-trained on COCO train2017
pretrained_backbone¶ (bool) – if true, returns a model with backbone pre-trained on Imagenet
trainable_backbone_layers¶ (int) – number of trainable resnet layers starting from final block

RetinaNet¶

class pl_bolts.models.detection.retinanet.retinanet_module.RetinaNet(learning_rate=0.0001, num_classes=91, backbone=None, fpn=True, pretrained=False, pretrained_backbone=True, trainable_backbone_layers=3, **kwargs)[source]

Bases: pytorch_lightning.

PyTorch Lightning implementation of RetinaNet.

Paper: Focal Loss for Dense Object Detection.

Paper authors: Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, Piotr Dollár

Model implemented by:

Aditya Oke <https://github.com/oke-aditya>

During training, the model expects both the input tensors, as well as targets (list of dictionary), containing:

boxes (FloatTensor[N, 4]): the ground truth boxes in [x1, y1, x2, y2] format.
labels (Int64Tensor[N]): the class label for each ground truh box

CLI command:

# PascalVOC using LightningCLI
python retinanet_module.py --trainer.gpus 1 --model.pretrained True

Parameters

learning_rate¶ (float) – the learning rate
num_classes¶ (int) – number of detection classes (including background)
backbone¶ (Optional[str]) – Pretained backbone CNN architecture.
fpn¶ (bool) – If True, creates a Feature Pyramind Network on top of Resnet based CNNs.
pretrained¶ (bool) – if true, returns a model pre-trained on COCO train2017
pretrained_backbone¶ (bool) – if true, returns a model with backbone pre-trained on Imagenet
trainable_backbone_layers¶ (int) – number of trainable resnet layers starting from final block

YOLO¶

class pl_bolts.models.detection.yolo.yolo_module.YOLO(network, optimizer=torch.optim.SGD, optimizer_params={'lr': 0.001, 'momentum': 0.9, 'weight_decay': 0.0005}, lr_scheduler=<class 'pl_bolts.optimizers.lr_scheduler.LinearWarmupCosineAnnealingLR'>, lr_scheduler_params={'max_epochs': 300, 'warmup_epochs': 1, 'warmup_start_lr': 0.0}, confidence_threshold=0.2, nms_threshold=0.45, max_predictions_per_image=-1)[source]

Bases: pytorch_lightning.

PyTorch Lightning implementation of YOLOv3 and YOLOv4.

YOLOv3 paper: Joseph Redmon and Ali Farhadi

YOLOv4 paper: Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao

Implementation: Seppo Enarvi

The network architecture can be read from a Darknet configuration file using the YOLOConfiguration class, or created by some other means, and provided as a list of PyTorch modules.

The input from the data loader is expected to be a list of images. Each image is a tensor with shape [channels, height, width]. The images from a single batch will be stacked into a single tensor, so the sizes have to match. Different batches can have different image sizes, as long as the size is divisible by the ratio in which the network downsamples the input.

During training, the model expects both the input tensors and a list of targets. Each target is a dictionary containing:

boxes (FloatTensor[N, 4]): the ground-truth boxes in (x1, y1, x2, y2) format
labels (Int64Tensor[N]): the class label for each ground-truth box

forward() method returns all predictions from all detection layers in all images in one tensor with shape [images, predictors, classes + 5]. The coordinates are scaled to the input image size. During training it also returns a dictionary containing the classification, box overlap, and confidence losses.

During inference, the model requires only the input tensors. infer() method filters and processes the predictions. The processed output includes the following tensors:

boxes (FloatTensor[N, 4]): predicted bounding box (x1, y1, x2, y2) coordinates in image space
scores (FloatTensor[N]): detection confidences
labels (Int64Tensor[N]): the predicted labels for each image

Weights can be loaded from a Darknet model file using load_darknet_weights().

CLI command:

# PascalVOC
wget https://raw.githubusercontent.com/AlexeyAB/darknet/master/cfg/yolov4-tiny-3l.cfg
python yolo_module.py --config yolov4-tiny-3l.cfg --data_dir . --gpus 8 --batch_size 8

Parameters

network¶ (ModuleList) – A list of network modules. This can be obtained from a Darknet configuration using the get_network() method.
optimizer¶ (Type[Optimizer]) – Which optimizer class to use for training.
optimizer_params¶ (Dict[str, Any]) – Parameters to pass to the optimizer constructor.
lr_scheduler¶ (Type[LRScheduler]) – Which learning rate scheduler class to use for training.
lr_scheduler_params¶ (Dict[str, Any]) – Parameters to pass to the learning rate scheduler constructor.
confidence_threshold¶ (float) – Postprocessing will remove bounding boxes whose confidence score is not higher than this threshold.
nms_threshold¶ (float) – Non-maximum suppression will remove bounding boxes whose IoU with a higher confidence box is higher than this threshold, if the predicted categories are equal.
max_predictions_per_image¶ (int) – If non-negative, keep at most this number of highest-confidence predictions per image.

configure_optimizers()[source]

Constructs the optimizer and learning rate scheduler.

Return type: Tuple[List, List]

forward(images, targets=None)[source]

Runs a forward pass through the network (all layers listed in self.network), and if training targets are provided, computes the losses from the detection layers.

Detections are concatenated from the detection layers. Each image will produce N * num_anchors * grid_height * grid_width detections, where N depends on the number of detection layers. For one detection layer N = 1, and each detection layer increases it by a number that depends on the size of the feature map on that layer. For example, if the feature map is twice as wide and high as the grid, the layer will add four times more features.

Parameters

images¶ (Tensor) – Images to be processed. Tensor of size [batch_size, num_channels, height, width].
targets¶ (Optional[List[Dict[str, Tensor]]]) – If set, computes losses from detection layers against these targets. A list of dictionaries, one for each image.

Returns

Detections, and if targets were provided, a dictionary of losses. Detections are shaped [batch_size, num_predictors, num_classes + 5], where num_predictors is the total number of cells in all detection layers times the number of boxes predicted by one cell. The predicted box coordinates are in (x1, y1, x2, y2) format and scaled to the input image size.

Return type

detections (Tensor), losses (Dict[str, Tensor])

infer(image)[source]

Feeds an image to the network and returns the detected bounding boxes, confidence scores, and class labels.

Parameters: image¶ (Tensor) – An input image, a tensor of uint8 values sized [channels, height, width].
Returns: A matrix of detected bounding box (x1, y1, x2, y2) coordinates, a vector of confidences for the bounding box detections, and a vector of predicted class labels.
Return type: boxes (Tensor), confidences (Tensor), labels (Tensor)

load_darknet_weights(weight_file)[source]

Loads weights to layer modules from a pretrained Darknet model.

One may want to continue training from the pretrained weights, on a dataset with a different number of object categories. The number of kernels in the convolutional layers just before each detection layer depends on the number of output classes. The Darknet solution is to truncate the weight file and stop reading weights at the first incompatible layer. For this reason the function silently leaves the rest of the layers unchanged, when the weight file ends.

Parameters: weight_file¶ – A file object containing model weights in the Darknet binary format.

test_step(batch, batch_idx)[source]

Evaluates a batch of data from the test set.

Parameters

batch¶ (Tuple[List[Tensor], List[Dict[str, Tensor]]]) – A tuple of images and targets. Images is a list of 3-dimensional tensors. Targets is a list of dictionaries that contain ground-truth boxes, labels, etc.
batch_idx¶ (int) – The index of this batch.

training_step(batch, batch_idx)[source]

Computes the training loss.

Parameters

batch¶ (Tuple[List[Tensor], List[Dict[str, Tensor]]]) – A tuple of images and targets. Images is a list of 3-dimensional tensors. Targets is a list of dictionaries that contain ground-truth boxes, labels, etc.
batch_idx¶ (int) – The index of this batch.

Return type

Dict[str, Tensor]

Returns

A dictionary that includes the training loss in ‘loss’.

validation_step(batch, batch_idx)[source]

Evaluates a batch of data from the validation set.

Parameters

batch¶ (Tuple[List[Tensor], List[Dict[str, Tensor]]]) – A tuple of images and targets. Images is a list of 3-dimensional tensors. Targets is a list of dictionaries that contain ground-truth boxes, labels, etc.
batch_idx¶ (int) – The index of this batch