Machine Learning

Definition:

Learn and improve from “experience” without emperically programmed
Most central problem is generalization, how to generalize the data learned to the data that the model never seen before
- principle of induction
Data set
- comes from a set of samples
- Each sample is a pair of input-output data
  - input $X$ : a set of features (independent variables)
  - output $Y$ : a label, a set of features (dependent variable)
  - Test set: final exam to test the ability of model to generalize
    - if test given is too different from what was taught, the algorithm can’t generalize pass its experience, it is a bad test
    - also a bad test if the test is exact same as what was taught, it doesn’t generalize at all
    - → a good test should test the ability to generalize
Target function $f : X \to Y$
- unknown
Function hypotheses $H = {h ∣ h : X \to Y}$
- a set of possible functions that can map input to output
- We aim to find $h \in H$ such best approximates $f$
Loss function: $l (y, f (x))$
- determines how bad is the predicted label compared to the true label
- ex: in classification, $l (y, f (x)) = 0$ if $y = f (x)$ , 1 if $y \neq = f (x)$
Data generating distribution: $D (x, y)$
- Probability density function over pair $x, y$
Expected loss: $E_{(x, y) \sim D} [l (y, f (x))] = \sum_{(x, y)} D (x, y) l (y, f (x))$
- We wish to minimize expected loss
- as we dont know $D$ but only samples of $D$ , we assume Law of large number, therefore, the sample has same distribution as D
- $\to E_{(x, y) \sim D} [l (y, f (x))] \approx \frac{1}{N} \sum_{n = 1}^{N} l (y^{(n)}, f (x^{(n)}))$

Two approaches in learning:

Eager learning:
- ex: Decision trees
- Learn/train: induce from an abstract model from the data
- Test/Predict/Classify: apply learned model to new data
Lazy learning:
- ex: Nearest neighbor
- Learn: store data in memory
- Test/Predict/Classify: compare new data to stored data
- Properties:
  - retians all information seen in training
  - complex hypothesis space
  - classification can be slow
To predict the target/label using features: Supervised Learning
To find interesting patterns in data: Unsupervised learning
Semi-supervised learning
Reinforcement Learning
Deep Learning

Inductive Bias:

What the machine learns might differ from what we intended
- ex: recognizing US vs Russian tanks model performs bad due to the quality of input pictures
Due to not enough variety of data to narrow down the relevant concept to learn

Fitting

Training Error over training data, $error_{train} (h)$
True Error over training data, $error_{true} (h)$
Overfitting
- if $error_{train} (h) < error_{test} (h)$
  - test error as we cant know the true error
- Amount of overfitting = $error_{true} (h) - error_{test} (h)$

Decision Tree

Classification Model and Nearest Neighbor for Classification

Perceptron Training

Linear Classifier

HMM

Model evaluation

Practical Consideration:

Hypothesis Testing

Accuracy score of once cant tell if a model is better than another, perhaps due to chance

Debugging:

Check implementation is correct
- measure loss rather than accuracy
- replace it with another easier dataset (toy)
Is data too noisy?
is learning problem too hard?

Strategies for isolating causes of errors:

Is representation adequate?
- Can you learn if you add a cheating feature that perfectly correlates with correct class?
Train/test mismatch?
- Try re-selecting train/test by shuffling training data and test together
Do you have enough data? Try training on 80% of the training set, how much does it hurt performance?

Hyperparameter tuning with validation set vs cross-validation

Cross-validation: loop through every hyperparameter setting for a hyperparameter and try which one leads to least error
N-fold cross validation:
- Instead of a single test-training split, split data into N equal-sized parts
- Train and test N different classifiers and report average & std.deviation accuracy

Model evaluation

Formalizing errors:
- $error (f) = [error (f) - min_{f^{*} \in F} error (f^{*})] + [min_{f^{*} \in F} error (f^{*})]$
  - Estimation error (variance) + approximation error (bias)
    - estimation error = error of chosen f to optimal $f^{*}$
    - approximation error = error of $f^{*}$ of the hypothesis chosen
  - F is set of all possible classifiers of a hypothesis
  - $f^{*}$ is optimal classifier

Bias/variance trade-off

if the learning algorithm is too flexible, it may fit each training dataset differently, high variance
- only one of bias or variance can be reduced at once and the optimum point should be chosen

Class imbalance problem

How to deal with:
1. Choose the right metrics: choose the one that is more relevant and important (ex: recall on cancer prediction is more important)
  - Symmetric vs asymmetric metrics
2. Data-level methods: resampling
  - Undersampling: remove samples from the majority class
    - can cause loss of information
    - ex: Tomek Links, find pairs of close samples of opposite classes and remove the majority class in each pair
      - Make decision boundary clear but make model underfit (learn less)
      - only work with low-dimension data
  - Oversampling: add more examples to the minority class
    - can cause overfitting
    - ex: SMOTE
      - Synthesize samples of minority class as convex (linear) combinations of existing points and their nearest neighbors of same class
      - only work with low-dimension data
3. Algorithm-level methods:
  - Instead of naive loss where all samples contribute equally to the loss, $L (X, θ) = \sum_{x} L (x, θ)$
  - Idea: training samples we care about should contribute more to the loss
  - Ex:
    - Cost-sensitive learning
      - let $C_{i, j}$ be cost of class i but classified as class j
      - The loss caused by instance x of class i will become the weighted average of all possible classifications of instance x: $L (x, θ) = \sum_{j} C_{i, j} P (j ∣ x, θ) = C_{i 0} P (y = 0∣ x, θ) + C_{i 1} P (y = 1∣ x, θ)$
    - class-balanced loss
      - Give more weight to rare classes - then you incentivize the model to learn to classify them better.
      - $L (X, θ) = \sum_{i} W_{y_{i}} L (x_{i}, θ)$
      - $W_{c} = \frac{N}{number of samples of class C}$
    - focal loss
      - Give more weight to the examples that the model is having difficulty with.
        
        downweighs well-classified samples
      - estimates probability of the model for class $y = 1 : p_{1} = {p 1 - p if y = 1 otherwise$
      - Cross-entropy: $CE (p_{t}) = - lo g (p_{t})$
      - Focal loss: $F L (p_{t}) = - (1 - p_{t})^{γ} lo g (p_{t})$

A probabilistic view:

Bayes theorem
Joint Distribution
Bayes Optimal Classifier
- Assume we know the data generating function, $D$
- We define Bayes Optimal Classifier as $f^{BO} (\overset{x}{^}) = ar g max_{\overset{y}{^}} D (\overset{x}{^}, \overset{y}{^})$
- Theorem: Of all the classifiers, the Bayes Optimal Classifier achieve the smallest zero/one loss
Training = estimating $D$ from the finite training set
- typically estimate $D$ as parameters of a probability distributions
- assumption: independent and identically distributed
- with maximum likelihood
- If $X_{i} ⊥ X_{j} ∣ Y, P (X_{1}, X_{2}, ..., X_{d}) = \prod_{i = 1}^{d} P (X_{i} ∣ Y)$
- Naive Bayes classifier

Logistic regression:

Binary classification:
- $P (Y^{(i)} = 1 ∣ X^{(i)}, θ) = g (< θ, X^{(i)} >) P (Y^{(i)} = 0 ∣ X^{(i)}, θ) = 1 - g (< θ, X^{(i)} >)$
- Predicted label $i$ of $y$ is $g (< θ, X^{(i)} >)$ which is a function that takes in a set of parameters $θ$ and sample $i$ of input $X$
We let $g (x)$ be a Sigmoid Function as it range from 0 to 1
Logistic regression with Maximum Likelihood Estimation:
- $θ max i = 1 \prod N P (Y^{(i)} ∣ X^{(i)}, θ) = θ max i = 1 \prod N g (< θ, X^{(i)} >)^{Y^{(i)}} (1 - g (< θ, X^{(i)} >))^{1 - Y^{(i)}}$
  - the parameter that make the product of observed samples as likely as possible
- taking $lo g$
- $max_{θ} \sum_{i = 1}^{N} Cross-entropy loss function Y^{(i)} lo g g (< θ, X^{(i)} >) + (1 - Y^{(i)}) lo g (1 - g (< θ, X^{(i)} >))$
  - Solve this by Gradient Descent and Properties
- Let $ℓ (θ) = y lo g (g (θ^{T} x)) + (1 - y) lo g (1 - g (θ^{T} x))$
- $\frac{\partial}{\partial θ _{j}} ℓ (θ) = \frac{\partial}{\partial θ _{i}} [y lo g (g (θ^{T} x)) + (1 - y) lo g (1 - g (θ^{T} x))]$
- $\frac{\partial}{\partial θ _{j}} ℓ (θ) = (y \frac{1}{g ( θ ^{T} x )} - (1 - y) \frac{1}{1 - g ( θ ^{T} x )}) \frac{\partial}{\partial θ _{j}} g (θ^{T} x) = (y \frac{1}{g ( θ ^{T} x )} - (1 - y) \frac{1}{1 - g ( θ ^{T} x )}) g (θ^{T} x) (1 - g (θ^{T} x)) \frac{\partial}{\partial θ _{j}} θ^{T} x = (y (1 - g (θ^{T} x)) - (1 - y) g (θ^{T} x)) x_{j} = (y - h_{θ} (x)) x_{j}$
  - Solve this by Gradient Descent and Properties, Stochastic gradient descent $θ_{k + 1} = θ_{k} + η (Y^{i} - g (< θ, X^{i} >)) X^{(i)}$

Multiclass classification:

Classification Model
Reductions for Multi-class Classification
- Given
  1. An input space $X$ and number of classes $K$
  2. An unknown distribution $D$ over $X \times [K]$
- Learning: Find a function $f$ minimizing $E_{(x, y) \sim D} [f (x) \neq = y]$
- In most task, this works for $K < 100$ , for larger $K$ , frame problem differently
- Reduction 1: One-versus-All
  - Train K number of binary classifiers, each classifier classifies whether a sample is in that class or not
  - At test time
    - If only one classifier predicts positive, predict that class
    - if not, Break ties randomly
  - OneVersusAllTraing(D_multiclass, BinaryTraing)
```
  for i = 1 to K:
    D_bin <- relabel D_multiclass so class i is positive and !i is negative
    f_i <- BinaryTrain(D_bin)
 endfor
  return f_1, f_2,..,f_K
```
  - OneVersusAllTest(f_1, f_2,...,f_K, hat_x)
```
score <- <0,0,...,0> //initialize K-many scores to 0
for i = 1 to K:
    y <- f_i(hat_x)
    score_i <- score_i + y
end for
return argmax_k(score_k)
```
- Reduction 2: All-versus-All
  - Train $K (K - 1) /2$ number of binary classifiers, each classifier classifies whether a sample is in that class i or class j for every pair of class
Recall that:
- $X^{(i)} \to function (example: linear) θ^{T} x \to g (.) (sigmoid) \to g (θ^{T} X^{(i)})$
- $g (θ^{T} X^{(i)} \approx 1 \to Y^{(i)} = 1$
- $g (θ^{T} X^{(i)} \approx 0 \to Y^{(i)} = 0$
- Let $θ_{0}$ be array of $0$
- $P (Y = 1 ∣ X, θ) = g (θ^{t} X) = \frac{1}{1 + e ^{- θ^{T} X}} = \frac{e ^{θ^{T} X}}{1 + e ^{θ^{T} X}} = \frac{e ^{θ^{T} X}}{e ^{θ_{0}^{t} X} + e ^{θ^{T} X}}$
- $P (Y = 0 ∣ X, θ) = 1 - g (θ^{t} X) = \frac{1}{1 + e ^{θ^{T} X}} = \frac{e ^{θ_{0}^{T} X}}{e ^{θ_{0}^{t} X} + e ^{θ^{T} X}}$
- → softmax function
This can be extended to work with multl-label classification:
Recall the binary case

θ max i = 1 \sum N Y^{(i)} lo g g (< θ, X^{(i)} >) + (1 - Y^{(i)}) lo g (1 - g (< θ, X^{(i)} >))

Multi-label case

θ max all samples \sum 1 {Y^{(i)} = label} lo g (corresponding confidence prob.)

Neural Network

PCA

better representations of our data point
- to visualize better, remove noise, less resource requirements, better classification
Sample variance of data project on vector $v$ is $\sum_{i = 1}^{n} (x_{i}^{T} v)^{2} = (X v)^{T} (X v) = v^{T} X^{T} X v$
Objective: Finding vector that maximize sample variance of projected data → Lagrangian folds constants into objective:
- $ar g max_{v} v^{T} X^{T} X v - λ (v^{T} v - 1)$
- solution: $v$ such $X^{T} X v = λ v$
The eigenvalue $λ$ denotes the amount of variability captured along dimension $v$
- sample variance of projection $v^{T} X^{T} X v = λ$
if we rank eigenvalues from large to small
- the 1st principle component of $X^{T} X$ is associated with largest eigenvalue
- the 2nd principle component of $X^{T} X$ is associated with 2nd largest eigenvalue
It also minimizes construction error: $\frac{1}{n} \sum_{i = 1}^{n} ∣∣ x_{i} - (v^{T} x_{i}) v ∣ ∣^{2}$
limit to only linear projection and only based on covariance

Autoencoder

Its non-linear dimensionality reduction method
If autoencoder doesnt have activation function, it is works same as PCA
$f (x) = s (w x + b) = z$
- $s$ is activation function such as sigmoid
- $z$ is the latent representation
$g (z) = s (w_{g}^{T} z + b_{g}) = \overset{x}{^}$
Then $h (x) = g (f (x)) = \overset{x}{^}$
pretraining helps the model start with weights that have already been optimized for general patterns, improving learning efficiency and potentially leading to better performance
Pretraining process with autoencoders
- Pretraining step: train a sequence of shallow autoencoders, greedily one layer at a time, using unsupervised data
- Fine-tuning step 1: train the last layer using supervised data
- Fine-tuning step 2: use backpropagation to fine-tune the entire network using supervised data.
- Why does this work?: it is easier to train one layer at a time and can utilize unlabeled data

Kernel methods/tricks:

Feature mapping:
- Add new dimensions to the vector so that its linearly separable
- example: $x = {x_{1}, x_{2}} \to z = {x_{1}^{2}, 2 x_{1} x_{2} . x_{2}^{2}} = ϕ (x)$
- Cons: more expensive to train, require more training examples
Kernel methods:
- rewrite linear models so that the mapping never needs to be explicitly computed, only depend on the dot products between 2 examples
  - $x$ and $z \to L (ϕ (x)^{T} ϕ (z))$
- Replace dot product $ϕ (x)^{T} ϕ (z)$ by $k (x, z)$
- example: $k (x, z) = (x^{T} z)^{2}$ is as same as $ϕ (x) = {x_{1}^{2}, 2 x_{1} x_{2}, x_{2}^{2}}$
Kernels: Formally defined
- each kernel $k$ has an associated feature mapping $ϕ$ that takes input $x \in χ$ (input space) and maps to $F$ (feture space)
- Kernel k takes 2 inputs and gives their similarity in $F$ space, $k : χ \times χ \to R, k (x, z) = ϕ (x)^{T} ϕ (z)$
- $F$ needs to be s vector space with a dot product defined on it, also called Hilbert Space
Mercer’s condition
- Not all function can be used as a kernal function, it must satisfy Mercer’s condition
- For $k$ to be a kernel function:
  - There must exist a Hilbert Space $F$ for which $k$ defines a dot product
  - The above is true if $K$ is a positive definite function
  - $\int d x \int d z f (x) k (x, z) f (z) > 0 (\forall f \in L_{2})$
    - For all square integrable functions $f$
Constructing combinations of kernels:
- Let $k_{1}, k_{2}$ be two kernel functions then the following are also kernel fns:
  - $k (x, z) = k_{1} (x, z) + k_{2} (x, z)$ : direct sum
  - $k (x, z) = α k_{1} (x, z)$ : scalar product
  - $k (x, z) = k_{1} (x, z) * k_{2} (x, z)$ : direct sum
- Commonly used kernel fns:
  - Linear (trivial) kernel: $k (x, z) = x^{T} z$ (mapping fn $ϕ$ is identity - no mapping)
  - Quadratic kernel: $k (x, z) = (x^{T} z)^{2}$ or $(1 + x^{T} z)^{2}$
  - Polynomial kernel (of degree d): $k (x, z) = (x^{T} z)^{d}$ or $(1 + x^{T} z)^{d}$
  - Radial basis function (RBF) kernel: $k (x, z) = exp [- γ ∣∣ x - z ∣ ∣^{2}]$
Kernelizing the perceptron:
- Naive approach: train a perceptron in the new feature space
  - $a \leftarrow w . ϕ (x) + b$
  - if $y a \leq 0$ :
    - $w \leftarrow w + y ϕ (x)$
    - $b \leftarrow b + y$
- Perceptron representer theorem: During a run of the perceptron algorithm, the weight vector w can always be represented as a linear combination of the expanded training data
  - $w = \sum_{n} a_{n} ϕ (x_{n})$
- We can use the perceptron representer theorem to compute activation as a dot product between examples:
  - $w . ϕ (x) + b = (\sum_{n} a_{n} ϕ (x_{n})) . ϕ (x) + b = \sum_{n} a_{n} [ϕ (x_{n}) . ϕ (x)] + b$

Support vector machines

In the dataset that is linearly separable, there are many hyperplane that helps us classify the dataset. But what is the “best” hyperplane?
SVM: a hyperplane based linear classifier defined by $w, b$
Assume that the data is linearly separable
- The problem reduced to
  - $min_{w, b} ∣∣ w ∣ ∣^{2}$
  - s.t $y_{i} (w^{T} x_{i} + b) \geq 1, i = 1, 2, \dots, n$
- Large margin ←> small $∣∣ w ∣∣$ → good generalization
- How to solve?
  - Lagranian → achieve dual problem
  - Solve the maximum of dual function in dual problem
  - According to KKT condition, the $λ$ in the lagranian is mostly 0.
If the data is not linearly separable:
- → add the slack variable
- $min_{w, b} ∣∣ w ∣ ∣^{2} + C \sum_{i} ϵ_{i}$
- s.t $y_{i} (w^{T} x_{i} + b) \geq 1 - ϵ_{i}, ϵ_{i} > 0, i = 1, 2, \dots, n$
- small $C$ → large margin and more misclassified samples
- large $C$ → small margin and less misclassified samples
Soft SVM:
- $ϵ_{i} = ma x (1 - y_{n} (w^{T} x_{n}), 0)$

Self-supervised learning:

A version of unsupervised learning where data provides the supervision.
- A type of representation learning.
In general, withhold some part of the data and the task a neural network to predict it from the remaining parts.
- Self-supervised learning tasks are also known as pretext tasks.
Details decide what pretext task the network tries to solve
Depending on the quality of the pretext task, good semantic features can be obtained without actual labels.
Cognitive principles:
- reconstruct from corrupted (or partial) version
  - Denoising Autoencoder
  - in-painting
  - Colorization, Split-brain encoder (split image into 2, each generate the other half)
- visual common sense tasks
  - relative patch prediction
  - jigsaw puzzles
  - rotation
- contrastive learning
  - word2vec
  - contrastive predictive coding (CPC)
  - Instance discrimination
- In general, considering spatial and temporal dimensions
  - Predict any part of the input from any other part
  - predict future from the past/recent past
  - preduct past from present
  - predict top from bottom
  - predict the occluded from the visible
  - → pretend there is a part of the input you dont know and predict that

Generative Modeling:

Simple generative models:
- histograms
  - If we know $p_{k}$ for each data point of a history
  - We can sample by having cumulative distribution from $F_{i} = p_{1} + ... + p_{i}$ for all $i$
  - Then return the smallest $i$ for $r an d (0, 1) < F_{i}$
  - → new datapoint is created
  - → but it comes too expensive in higher dimetion
  - → training in data distribution, poor generalization
Parameterized distribution: Likelihood-based Generative Model
- approximate the function with parameters
- Learn $θ$ so that $p_{θ} (x) \approx p_{d a t a} (x)$
- There will be many choices for model design, each with different tradeoffs and different compatibility criteria
- Set up model class: a set of parameterized distribution $p_{θ}$
- Search/optimization problem $ar g min_{θ} loss (θ, x^{1}, x^{2}, ...)$
- the loss fn and search problem must work with large dataset and yield $θ$ that $p_{θ}$ matches $p_{d a t a}$ → loss is small enough
- Maximum likelihood, $ar g min)_{θ} loss (θ, x^{(1)}, x^{(2)}, ...) = \frac{1}{n} \sum_{i = 1}^{n} - lo g p_{θ} (x^{(1)})$
- → enough dataabd model family is expressive enought → solving it will yield parameters
- Solve with Stochastic gradient descent $ar g min_{θ} E_{x \sim \overset{p}{^}_{d a t a}} [- lo g p_{θ} (x)]$
- Equivalent to minimizing KL divergence between the empirical data distribution and the model
  - .

StrixTheKiet Notes

Explorer

Machine Learning

Definition:

Two approaches in learning:

Inductive Bias:

Fitting

Decision Tree

Classification Model and Nearest Neighbor for Classification

Perceptron Training

Linear Classifier

HMM

Model evaluation

Practical Consideration:

Hypothesis Testing

Debugging:

Strategies for isolating causes of errors:

Hyperparameter tuning with validation set vs cross-validation

Model evaluation

Bias/variance trade-off

Class imbalance problem

A probabilistic view:

Logistic regression:

Multiclass classification:

Neural Network

PCA

Autoencoder

Kernel methods/tricks:

Support vector machines

Self-supervised learning:

Generative Modeling:

Graph View

Table of Contents

Backlinks