The starter code can be found at:
We recommend editing and running your code in Google Colab, although you are welcome to use your local machine instead.
We’ll start by implementing a simple self-supervised learning method: an autoencoder. The autoencoder is composed of an encoder and a decoder. The encoder often compresses the original data with a funnel-like architecture, i.e., it throws away redundant information by reducing the layer sizes gradually. The final output size of the encoder is a bottleneck that is much smaller than the size of the original data. The decoder will use this limited amount of information to reconstruct the original data. If the reconstruction is successful, the encoder has arguably captured a useful, concise representation of the original data.
Such representations could help with downstream tasks such as object recognition, semantic segmentation, etc. Here, to test the usefulness of the representation, we’ll train the encoders
on the STL-10 dataset, which is designed to evaluate unsupervised learning algorithms. This dataset contains 100,000 unlabeled images, 5,000 labeled training images, and 8,000 labeled test images. To keep training time short, we’ll use 10,000 unlabeled images to learn representations.
We’ll then use the feature representation that we learned to train an object recognition model(a simple linear classifier) on the 5,000 labeled training images. If the learned representations are useful, we should obtain a performance improvement over only using the small, labeled training set.
1. We will build a small convolutional autoencoder and train it on the STL-10 dataset. The conv layers in the autoencoder all have kernel size = 4×4, stride = 2, padding = 1.
2. With the trained autoencoder, we freeze the parameter of the encoder and train a linear classifier on the autoencoder representations, i.e., the output of the encoder. You will compare
the accuracy of the linear classifier with two other linear classifiers. One is trained together with the encoder and the other one is trained on top of a randomly initialized encoder. Confirm
that the unsupervised pretraining improves the classification accuracy compared to the random baseline. Method I should achieve about 30% accuracy on the test set. Method II should achieve above 40% accuracy. Method III performs the worse among these three.
List of functions/classes to implement:
1. class Encoder (1 pt)
2. class Decoder (1 pt)
3. def train ae (1 pt)
4. def train classfier and set the supervised parameter in three methods (1 pt)
5. Report results in the Report results section at the end of the notebook (1 pt)
We covered contrastive learning (CL) in lecture 15. CL is an approach of self-supervised learning [1, 2, 3] that avoids the need to explicitly generating images. Here, we’ll implement a recent contrastive learning method, Contrastive Multiview Coding (CMC) . We’ll learn a
vector representation for images: in this representation, two artificially corrupted versions of the same image should have a large dot product, while dot products of two different images
should have a small dot product. In CMC, these corruptions are views of an image that contain complementary information. For example, in this problem set, our views will be luminance (i.e. grayscale intensity) and chromaticity (i.e. color) in the Lab color space. A good representation should create similar vectors for these two views (i.e. that have a large dot product), and they should therefore contain the information that is shared between the views. We’ll minimize the loss:
where v1 and v2 are two different views of the data, k is the number of negative samples.
The function hθ measures the similarity between the representations of the two views, and is implemented using a neural network:
and fθ1 and fθ2 are encoders for extracting representations from view 1 and view 2, respectively.
The constant τ is the temperature hyperparameter for controlling the range of the numbers that are exponentiated.