Skip links
Drag

计算机视觉代写|EECS 442 Problem Set 8: Representation Learning

这是一个美国的Python计算机视觉Problem Set代写

The starter code can be found at:

https://colab.research.google.com/drive/1C3B4Wf6Wqlp7FMr7fVL6ORaeQXcpeSjA?usp=
sharing

We recommend editing and running your code in Google Colab, although you are welcome to
use your local machine instead.

Problem 8.1 Autoencoders (5 pts)

We’ll start by implementing a simple self-supervised learning method: an autoencoder. The
autoencoder is composed of an encoder and a decoder. The encoder often compresses the
original data with a funnel-like architecture, i.e., it throws away redundant information by
reducing the layer sizes gradually. The final output size of the encoder is a bottleneck that is
much smaller than the size of the original data. The decoder will use this limited amount of
information to reconstruct the original data. If the reconstruction is successful, the encoder
has arguably captured a useful, concise representation of the original data.

Such representations could help with downstream tasks such as object recognition, semantic
segmentation, etc. Here, to test the usefulness of the representation, we’ll train the encoders
on the STL-10 dataset, which is designed to evaluate unsupervised learning algorithms. This
dataset contains 100,000 unlabeled images, 5,000 labeled training images, and 8,000 labeled test
images. To keep training time short, we’ll use 10,000 unlabeled images to learn representations.

We’ll then use the feature representation that we learned to train an object recognition model
(a simple linear classifier) on the 5,000 labeled training images. If the learned representations
are useful, we should obtain a performance improvement over only using the small, labeled
training set.

1. We will build a small convolutional autoencoder and train it on the STL-10 dataset. The
conv layers in the autoencoder all have kernel size = 4×4, stride = 2, padding = 1.

2. With the trained autoencoder, we freeze the parameter of the encoder and train a linear
classifier on the autoencoder representations, i.e., the output of the encoder. You will compare

the accuracy of the linear classifier with two other linear classifiers. One is trained together
with the encoder and the other one is trained on top of a randomly initialized encoder. Confirm
that the unsupervised pretraining improves the classification accuracy compared to the random
baseline. Method I should achieve about 30% accuracy on the test set. Method II should
achieve above 40% accuracy. Method III performs the worse among these three.

List of functions/classes to implement:

1. class Encoder (1 pt)
2. class Decoder (1 pt)
3. def train ae (1 pt)
4. def train classfier and set the supervised parameter in three methods (1 pt)
5. Report results in the Report results section at the end of the notebook (1 pt)

Problem 8.2 Contrastive Multiview Coding (5 pts)

We covered contrastive learning (CL) in lecture 15. CL is an approach of self-supervised
learning [1, 2, 3] that avoids the need to explicitly generating images. Here, we’ll implement a
recent contrastive learning method, Contrastive Multiview Coding (CMC) [2]. We’ll learn a

vector representation for images: in this representation, two artificially corrupted versions of
the same image should have a large dot product, while dot products of two different images
should have a small dot product. In CMC, these corruptions are views of an image that contain
complementary information. For example, in this problem set, our views will be luminance (i.e.
grayscale intensity) and chromaticity (i.e. color) in the Lab color space. A good representation
should create similar vectors for these two views (i.e. that have a large dot product), and they
should therefore contain the information that is shared between the views. We’ll minimize the
loss:

where v1 and v2 are two different views of the data, k is the number of negative samples.
The function hθ measures the similarity between the representations of the two views, and is
implemented using a neural network:

and fθ1 and fθ2 are encoders for extracting representations from view 1 and view 2, respectively.
The constant τ is the temperature hyperparameter for controlling the range of the numbers
that are exponentiated.

Leave a comment