The starter code can be found at:
We recommend editing and running your code in Google Colab, although you are welcome to
use your local machine instead.
Problem 7.1 Object Detection
In this problem set, we will implement a single-stage object detector, based on YOLO v1 .
Unlike the more performant R-CNN models, single-stage detectors simply predict bounding
boxes and classes without explicitly cropping region proposals out of the image or feature map.
This makes them significantly faster to run, and simpler to implement.
We’ve given you the code for the object detection system, but we’ve left a few key functions
unimplemented. Your task is to 1) understand the model and code and 2) fill in these missing
pieces. Consequently, this problem set will require you to read significantly more code than in
previous problem sets. However, the amount of code you actually write will be comparable to
prior problem sets.
We’ll train and evaluate our detector on the PASCAL VOC dataset, a standard dataset for
object detection tasks. The full dataset contains a total of 11K train/validation data images
with 27K labeled objects, spanning 20 classes (Figure 1).
Below, we outline the steps in the object detection pipeline and the modules that you will be
implementing. The instructions here are not exhaustive, and you should refer to the comments
in the provided notebook for further implementation details. Also, we encourage you to not
necessarily jump directly to the part of the notebook that requires you to write code | instead,
we recommend first reading the comments in the system we’ve provided, and to understand
the utility of each function by understanding their inputs and outputs.
(a) We will use MobileNetv2  as our backbone network for image feature extraction. This
is a simple convolutional network intended for efficient computation. To speed up training,
we’ve pretrained the network to solve ImageNet classification. This is already implemented for
you. (0 points)
(b) After passing the input image through the backbone network, we have a convolutional
feature map of shape (D; 7; 7) which we interpret as a 7×7 grid of D−dimensional features. At
each cell in this grid, we’ll predict a set of A bounding boxes.
The format of these bounding boxes is as follows: consider a grid with center (xg c; ycg). The
prediction network, that you will fill in later will predict offsets (tx; ty; tw; th) with respect
to this center. By applying this transformation, we get the bounding box or proposal with
center, width, and height (xp c; ycp; wp; hp). To convert the offsets to the actual bounding
box parameters, read the instructions in the notebook to implement the GenerateProposal
function. (2 points)