Undersampling: remove samples from the majority class
can cause loss of information
ex: Tomek Links, find pairs of close samples of opposite classes and remove the majority class in each pair
Make decision boundary clear but make model underfit (learn less)
only work with low-dimension data
Oversampling: add more examples to the minority class
can cause overfitting
ex: SMOTE
Synthesize samples of minority class as convex (linear) combinations of existing points and their nearest neighbors of same class
only work with low-dimension data
Algorithm-level methods:
Instead of naive loss where all samples contribute equally to the loss, L(X,θ)=∑xL(x,θ)
Idea: training samples we care about should contribute more to the loss
Ex:
Cost-sensitive learning
let Ci,j be cost of class i but classified as class j
The loss caused by instance x of class i will become the weighted average of all possible classifications of instance x: L(x,θ)=∑jCi,jP(j∣x,θ)=Ci0P(y=0∣x,θ)+Ci1P(y=1∣x,θ)
class-balanced loss
Give more weight to rare classes - then you incentivize the model to learn to classify them better.
L(X,θ)=∑iWyiL(xi,θ)
Wc=number of samples of class CN
focal loss
Give more weight to the examples that the model is having difficulty with.
downweighs well-classified samples
estimates probability of the model for class y=1:p1={p1−pif y=1otherwise
If autoencoder doesnt have activation function, it is works same as PCA
f(x)=s(wx+b)=z
s is activation function such as sigmoid
z is the latent representation
g(z)=s(wgTz+bg)=x^
Then h(x)=g(f(x))=x^
pretraining helps the model start with weights that have already been optimized for general patterns, improving learning efficiency and potentially leading to better performance
Pretraining process with autoencoders
Pretraining step: train a sequence of shallow autoencoders, greedily one layer at a time, using unsupervised data
Fine-tuning step 1: train the last layer using supervised data
Fine-tuning step 2: use backpropagation to fine-tune the entire network using supervised data.
Why does this work?: it is easier to train one layer at a time and can utilize unlabeled data
Kernel methods/tricks:
Feature mapping:
Add new dimensions to the vector so that its linearly separable
example: x={x1,x2}→z={x12,2x1x2.x22}=ϕ(x)
Cons: more expensive to train, require more training examples
Kernel methods:
rewrite linear models so that the mapping never needs to be explicitly computed, only depend on the dot products between 2 examples
Replace dot product ϕ(x)Tϕ(z) by k(x,z)
example: k(x,z)=(xTz)2 is as same as ϕ(x)={x12,2x1x2,x22}
Kernels: Formally defined
each kernel k has an associated feature mapping ϕ that takes input x∈χ (input space) and maps to F (feture space)
Kernel k takes 2 inputs and gives their similarity in F space, k:χ×χ→R,k(x,z)=ϕ(x)Tϕ(z)
F needs to be s vector space with a dot product defined on it, also called Hilbert Space
Mercer’s condition
Not all function can be used as a kernal function, it must satisfy Mercer’s condition
For k to be a kernel function:
There must exist a Hilbert Space F for which k defines a dot product
The above is true if K is a positive definite function