Description:

  • Made up of nodes or units, connected by links
  • Each link has an associated weight and activation level
  • Each node has an input function (typically summing over weighted inputs) or an activation function and an output
  • X to Y can be non-linear, and can still use Stochastic gradient descent

Activation functions:

    • ex: binary step activation fn
  • There are more functions

Multi-layer neural network:

Theorem 9 in CIML:

  • Two-Layer Networks are Universal Function Approximator
    • Approximation error is 0
    • A two-layer neural network can approximate any continuous function, given enough neurons in the hidden layer.

Expressiveness of NN:

  • Deeper layers of a trained network learn more and more complex functions

Compositionality via mathematics:

  • Given a library of simple functions, ex:
  • If each node is one of the functions, next layer’s node can be:
    • Linear combination:
    • Composition:
      • for deep learning, Hierarchical Compositionality is used
        • vision: pixels edge texton motif part object
        • speech: sample spectral formant motif phone word
        • NLP: character word NP/VP/.. clause sentense story

SDG:

  • If the objective of optimization is convex, it will reach the bottom
  • If not, SDG still perform well
  • But we will need to have the loss function

Training a neural network:

  • The backpropagation algorithm = gradient descent + Chain rule
  • To search for set of weight values that minimize the total error of the network over the set of training examples
  • Repeated procedures of the following 2 passes:
    • forward pass: compute the outputs of all units in the network, and the error of the output layers
    • backward pass: the network error is used for updating the weights
      • Starting at the output layer, the error is propagated backwards through the network, layer by layer.
      • This is done by recursively computing the local gradient of each neuron.
  • Gradient of objective w.r.t. output layer weights
  • Gradient of objective w.r.t. hidden unit weights: