Machine Learning for Visual Understanding (Old Version)

Summary

This course covers mathematical modeling and machine learning techniques to analyze visual (and other multimedia) data. Specifically, this course focuses on fundamental machine learning and recent deep learning methods that are widely used in visual data analysis, and discusses how these methods are applied to solve various problems with visual data. This course consists of lectures, practices, and a team project. Topics include * Review of machine learning and neural networks, * Convolutional Neural network (CNNs), * Recurrent neural networks (RNNs) * Image problems (image classification, object detection, segmentation), * Video problems (video classification, action recognition, temporal localization, tracking), * Multi-modal data analysis (visual-audio-text), * Generative modeling, and more.

Note that this course is refreshed in 2023. Students are encouraged to study with the new version, unless they need English lecture recordings.

Logistics

Textbook:
- P: "Probabilistic Machine Learning: An Introduction (2nd Ed.)" by Kevin Murphy, 2021, MIT Press. (Official online copy)
- D: "Deep Learning" by Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2015, MIT Press. (Official online copy)
- Additional reading materials and papers are linked below.
Prerequisite:
- Intermediate+ Python programming: you should be able to code what you think in Python.
- Machine learning basics: took this course or equivalent
- Basic calculus, linear algebra, datastructures and algorithms
Note: Lecture recordings can be in private mode during semesters. Please contact the instructor for access permission.

Lectures

Topic	Korean	English	Reading
1. Course Introduction About the course Introduction to computer vision			D 1.2 Blog (1, 2)
2. First Approaches for Image Classification Nearest neighbors approach Linear classifier			Lecture note P 16.1
3. Loss Functions and Optimization Logistic regression (softmax classifier) Loss functions Basic optimization (Gradient Descent, SGD)			Lecture note P 10.2 - 10.3, 8.1 - 8.2, 8.4 D 4.3, 5.9, 8.1 - 8.3
4. Neural Networks Basics & Backpropagation Neural networks Backpropagation			Lecture note P 13.1 - 13.3 D 6.1 - 6.5
5. Convolutional Neural Networks Convolutional layer Stride, padding, 1x1 conv			Lecture note P 14.1 - 14.3, 15.3 D 9.1 - 9.7
6. Training Neural Networks I Activation functions Data preprocessing Data augmentation Weight initialization			Lecture note P 13.4, 19.1, 20.1 D 5.8.1, 6.2-6.3, 7.4
7. Training Neural Networks II Regularization for neural nets Optimization beyond SGD Learning rate scheduling Batch normalization			Lecture note P 8.3, 11.3, 11.5, 13.5 D 8.1 - 8.3, 7.1 - 7.3, 7.8, 7.12 Batch norm
8. Transfer Learning, CNN Case Studies Transfer learning CNN Case studies: AlexNet, ZFNet, VGG, GoogLeNet, ResNet, Inception-v2,3,4			Lecture note P 19.2 AlexNet, VGG, GoogleNet, ResNet, Inception v2,3, Inception v4
9. Object Detection Proposal-based (R-CNN, Fast R-CNN, Faster R-CNN, R-FCN) Proposal-free (YOLO, SSD)			P 14.4.2 R-CNN, Fast R-CNN, Faster R-CNN, R-FCN, YOLO, SSD
10. Segmentation Semantic segmentation (DeconvNet) Instance segmentation (Mask R-CNN)			P 14.4.4 DeconvNet, U-Net, Mask R-CNN
11. Video Classification I Video understanding tasks Challenges with video understanding Temporal pooling Action recognition models: Two-stream approaches, optical flow			Temporal pooling, Two-stream model, Two-stream fusion
12. Video Classification II 3D convolutional models: C3D, R3D, R(2+1)D, T3D Combining 3D-conv and two-stream ideas: I3D, S3D More recent convolutional models: SlowFast, X3D			C3D, R3D (ArXiv, CVPR), I3D, S3D, SlowFast, X3D
13. Recurrent Neural Networks RNN Basics LSTM/GRU			P 15.1 - 15.2 D 10.1 - 10.7, 10.10 - 10.11
14. RNN-based Video Models Spatio-temporal modeling: LRCN, Beyond short snippet, ConvLSTM, ConvGRU Attention mechanism Attention-based video models: MultiLSTM, Visual Attention YouTube 8M			P 15.4 LRCN, BSS, ConvLSTM, ConvGRU, MultiLSTM, Visual Attention, Attention blogpost
15. Metric Learning Learning to Rank Triplet loss, hard negative mining: FaceNet, CDML, GCML Contrastive learning: NCE, SimCLR			P 16.2 FaceNet, CDML, GCML, Contrastive, NCE, SimCLR
16. Transformers Word embeddings Transformer, BERT Vit, ViViT, TimeSFormer (in English version only)			P 19.5.2, 15.5 Transformer (blog), BERT, ViT, ViViT, TimeSFormer
17. Multimodal Representation Learning Image-text models: ViLBERT, VL-BERT Video-text models: VideoBERT, CBT, Hammer, HERO			VL-BERT, VilBERT, VideoBERT, CBT, Hammer, HERO
18. Image/Video Captioning Image captioning: LRCN, NCE-based, Show-Attend-Tell (spatial attention) Video captioning: temporal attention			LRCN, NCE, SAT, Video captioning
19. Generative Models I PixelRNN, PixelCNN, Super-resolution Autoencoders, Denoising Autoencoders (DAE) Variational Autoencoders (VAE)			P 20.3 D 14.1 - 14.9 PixelRNN/CNN, Super Resolution, DAE, VAE
20. Generative Models II Generative Adversarial Networks (GAN) Deep Convolutional GAN (DCGAN) Wasserstein GAN, Gradient Penalty GANs for image-to-image translation: Pix2pix, CycleGAN, DiscoGAN, StarGAN, StyleGAN			GAN, DCGAN, WGAN, WGAN-GP, Pix2pix, CycleGAN, DiscoGAN, StarGAN, StyleGAN