Definition:
- Features: Wi is the word at position i
- Predict label conditioned on feature variables
- Assume features are conditionally independent given label
Model:
- P(Y,W1,...,Wn)=P(Y) ∏iP(Wi∣Y)
Prediction:
- Y^=argmaxyP(y,W1,...,Wn)
Parameters:
- for each word there is probablity given the class
- ex: Spam Email Filter: P(Free∣spam)>P(hello∣spam)
- MLE for Naive Bayes Spam Classifier:
- Find a single parameter for each word P(Fi∣Y=ham) as θ
- L(θ)=∏j=1NhP(Fi=fi(j)∣Y= ham )=∏j=1Nhθfi(j)(1−θ)1−fi(j)
- P(Fi=fi(j)∣Y=ham)={θ(1−θ) if fi(j)=1 if fi(j)=0
- P(Fi∣Y= ham ):θ=Nh1j=1∑Nhfi(j)
- Aware of Fitting: Problems with relative-frequency parameters
- Unlikely to see occurrences of every words in training data.
- Likely to see occurrences of a word for only 1 class in training data.
Parameter estimation:
- Pθ(x) means the probability of event x occurring, given the parameter θ.
with maximum likelihood:
- Estimating the distribution of a random variable
- Empirically: use training data (learning!)
- E.g.: red and blue
- For a simple example of guessing if a bean is red/blue, the parameter is the probability of it being red/blue
- for each outcome x, look at the empirical rate of that value: P(r)=nb of samplescount(r)
- Pθ(x=r)=θ, Pθ(x=b)=1−θ
- Maximum Likelihood Estimation L(x,θ)=∏iPθ(xi)=θ.θ.(1−θ)
- a function that assigns a value to different possible parameter values based on how well they explain the observed data.
- Higher values of L(x, θ) indicate that the parameter value θ is more likely to have resulted in the observed data x
- General cases of n outcomes:
- P(H)=q,P(T)=1−q
- Flips are independent and identially distributed D={xi∣i=1,...,n}, P(D∣θ)=∏iP(xi∣θ)
- D is sequence of data D, P(D∣θ)=θαH(1−θ)αT
- Hypothesis space: binomial distributions
- Learning: finding q which is optimal
- MLE solve: choose q that maximize θ^=argmaxθP(D∣θ)=argmaxθlnP(D∣θ)
- ex: 2 heads and 1 tail →argmaxθln(θ2(1−θ))
- for n observations: dθdlnP(D∣θ)=dθdlnθαH(1−θ)αT=0→ θ^MLE=αH+αTαH
Smoothing:
Laplace Smoothing:
- Laplace’s estimate:
- Pretend you saw every outcome once more than you actually did
- PLAP(x)=∑x[c(x)+1c(x)+1=N+∣X∣c(x)+1 with Dirichlet priors
- Laplace’s extended estimate:
- Pretend you saw every outcome k times more than you actually did
- PLAP,k(x)=N+k∣X∣c(x)+k where k is the strenth of the prior
- Laplace for conditionals:
- Smooth each condition independently: PLAP,k(x∣y)=c(y)+k∣X∣c(x,y)+k
Naive Bayes classifier:
- y^=argymaxP(Y=y∣X=x)=argymaxP(X=x)P(X=x∣Y=y)P(Y=y)=argymaxP(Y=y)i=1∏dP(Xi=xi∣Y=y)