Definition:

  • Properties:
    • Online:
      • Look at one example at a time, and update the model as soon as we make an error
      • Compared to batch algorithms that update parameters after seeing the entire training dataset
    • Error-driven: We only update parameters if we make an error.
  • Practical considerations:
    • Order of training examples matter, random is better
    • Early stopping: to avoid overfitting
    • Simple modifications dramatically improve performance: voting or averaging
  • Now hypothesis is a Hyperplane that best separate a linearly separable sample
    • the bias scalar can be encoded in by adding extra dimension?

Perceptron prediction algorithm:

    • //compute activation for the test example
    • return sign(a)
  • Standard Perceptron training algorithm:
    • // initialize weights
    • // initialize bias
    • for iter = 1…MaxIter do
      • for all (x,y) ∈ D do
        • // compute activation for this example
        • if then // if opposite sign
          • ,for all d = 1…D // update weights
          • // update bias
      • end for
    • end for
    • return
  • Other variations that predict based on final + intermediate parameters:
    • Require to keep track of “survival time” of weight vectors?
    • The voted perceptron:
    • The averaged perceptron:
    • Averaged Perceptron with efficient decision:
      • combination of weighted sum of and weighed sum of bias
  • Sometimes perceptron cant converge

Convergence of perceptron:

  • Theorem Block and Novikoff
  • If a training dataset is linearly separable with a margin by a unit norm hyperplane with
  • The perceptron training algorithm converges after errors during training (assuming )
  • Margin of a dataset:
    • Distance between the separating hyperplane and the nearest point in dataset
    • ?
    • We want the hyperplane to have largest attainable margin on D
      • less margin of error
  • Perceptron converges quickly when margin is large, slowly when margin is small
  • Bound does not depend on number of training examples, nor on number of features
  • Proof guarantees that perceptron converges, but not necessarily to the max margin separator (there are several possible
  • Practical Implications
    • Sensitive to noise: No convergence or accuracy guarantee if the data is not linearly separable due to noise
    • Linear separability in practice
      • Data may be linearly separable in practice
      • Especially when the number of features >> number of examples
    • Overfitting:
      • Early stopping and Averaging