Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Neural Networks 1¹

^{1. Some materials are taken from machine learning course of Victor Kitov}

Let's recall previous lecture¶

Boosting, Ensembles¶

Construction of multiple models to increase model quality
- In parallel (Bagging, Blending, Stacking, Random Forest)
- Sequentially (Boosting)
Works great!
Hard to interpret

Neural Networks¶

History¶

Neural networks originally appeared as an attempt to model human brain
Human brain consists of multiple interconnected neuron cells
- cerebral cortex (the largest part) is estimated to contain 15-33 billion neurons
- communication is performed by sending electrical and electro-chemical signals
- signals are transmitted through axons - long thin parts of neurons.

History ¶

1943 – The first mathematical model of a neural network (Walter Pitts and Warren McCulloch)
1957 – Setting the foundation for deep neural networks (Frank Rosenblatt)
1965 – The first working deep learning networks
1979-80 – An ANN learns how to recognize visual patterns
1982 – The creation of the Hopfield Networks
1989 – Machines read handwritten digits (Yann LeCun)
1997 – Long short-term memory was proposed (Jürgen Schmidhuber and Sepp Hochreiter)
1998 – Gradient-based learning (Yann LeCun)
2011 – Creation of AlexNet
2014 – Generative Adversarial Networks (GAN)

Simple model of a neuron¶

Neuron get's activated in the half-space, defined by $b+w_{1}x^{1}+w_{2}x^{2}+...+w_{D}x^{D}\ge0$.
Each node is called a neuron
Each edge is associated a weight
Constant feature $b$ stands for bias (some times reffered as $w_0$)

Multilayer perceptron architecture¶

Hierarchically nested set of neurons.
Each node has its own weights.

This is structure of multilayer perceptron - acyclic directed graph.

Layers¶

Structure of neural network:
- 1-input layer
- 2-hidden layers
- 3-output layer

Continious activations¶

Pitfall of $\mathbb{I}[]$: it causes stepwise constant outputs, weight optimization methods become inapliccable.
We can replace $\mathbb{I}[w^{T}x+w_{0}\ge0]$ with smooth activation $f(w^{T}x+w_{0})$

Typical activation functions¶

sigmoidal: $\sigma(x)=\frac{1}{1+e^{-x}}$
- 1-layer neural network with sigmoidal activation is equivalent to logistic regression
hyperbolic tangent: $tangh(x)=\frac{e^{x}-e^{-x}}{e^{x}+e^{-x}}$

Typical activation functions¶

ReLU: $f(x)=[x]_{+}$.

Activation function zoo¶

TensorFlow PlayGround ¶

Output Generation¶

Output generation¶

Forward propagation is a process of successive calculations of neuron outputs for given features.

Definition details¶

Label each neuron with integer $j$.
Denote: $I_{j}$ - input to neuron $j$, $O_{j}$ - output of neuron $j$
Input to neuron $j$: $I_{j}=\sum_{k\in inc(j)}w_{kj}O_{k}+w_{0j}$,
Output of neuron $j$: $O_{j}=f(I_{j})$.
- $w_{0j}$ is the bias term
- $f(x)$ is the activation function
- $inc(j)$ is a set of neurons with outgoing edges incoming to neuron $j$.
- further we will assume that at each layer there is a vertex with constant output $O_{const}\equiv1$, so we can simplify notation

$$ I_{j}=\sum_{k\in inc(j)}w_{kj}O_{k} $$

Activations at output layer¶

Regression: $f(I)=I$ (linear activation)
Classification:
- binary: $y\in\{+1,-1\}$ $$ f(I)=p(y=+1|x)=\frac{1}{1+e^{-I}} $$
- multiclass: $y\in{1,2,...C}$ $$ f(I_{1},...I_{C})=p(y=j|x)=\frac{e^{I_{j}}}{\sum_{k=1}^{C}e^{I_{k}}},\,j=1,2,...C $$ where $I_{1},...I_{C}$ are inputs of output layer.

Generalizations¶

each neuron $j$ may have custom non-linear transformation $f_{j}$
weights may be constrained:
- non-negative
- equal weights
- etc.
layer skips are possible
Not considered here: RBF-networks, recurrent networks.

Number of layers selection¶

Number of layers usually denotes all layers except input layer (hidden layers+output layer)
Classification:
- single layer network selects arbitrary half-spaces
- 2-layer network selects arbitrary convex polyhedron (by intersection of 1-layer outputs)
  - therefore it can approximate arbitrary convex sets
- 3-layer network selects (by union of 2-layer outputs) arbitrary finite sets of polyhedra
  - therefore it can approximate almost all sets with well defined volume

Number of layers selection¶

Regression:
- single layer can approximate arbitrary linear function
- 2-layer network can model indicator function of arbitrary convex polyhedron
- 3-layer network can uniformly approximate arbitrary continuous function (as sum weighted sum of indicators convex polyhedra)

Sufficient amount of layers

Any continuous function on a compact space can be uniformly approximated by 2-layer neural network with linear output and wide range of activation functions (excluding polynomial).

In practice often it is more convenient to use more layers with less total amount of neurons
- model becomes more interpretable and easy to fit.

Neural network optimization¶

Network optimization: regression¶

Single output:

$$ \frac{1}{N}\sum_{n=1}^{N}(\widehat{y}_{n}(x_{n})-y_{n})^{2}\to\min_{w} $$

K outputs

$$ \frac{1}{NK}\sum_{n=1}^{N}\sum_{k=1}^{K}(\widehat{y}_{nk}(x_{n})-y_{nk})^{2}\to\min_{w} $$

Network optimization: classification¶

Two classes ($y\in\{0,1\}$):

$$ \prod_{n=1}^{N}p(y_{n}=1|x_{n})^{y_{n}}(1-p(y_{n}=1|x_{n})){}^{1-y_{n}}\to\max_{w} $$

$C$ classes ($y_{nc}=\mathbb{I}\{y_{n}=c\}$):

$$ \prod_{n=1}^{N}\prod_{c=1}^{C}p(y_{n}=c|x_{n})^{y_{nc}}\to\max_{w} $$

In practice log-likelihood (cross-entropy) is maximized

Neural network optimization¶

Let $L(\widehat{y},y)$ denote the loss function of output

We may optimize neural network using gradient descent:

  k=0
  initialize randomly w_0 # small values for sigmoid and tangh

  while stop criteria not met:
      w_k+1 := w_k - alpha * grad(L(w_k))
      k := k+1

Standardization of features makes gradient descend converge faster
But how exactly do we efficiently calculate grad(L(w_k))?

Backpropagation algorithm¶

Idea¶

Definitions¶

Denote $w_{ij}$ as weight of edge, connecting $i$-th and $j$-th neuron
Define $\delta_j = \frac{\partial L}{\partial I_j} = \frac{\partial L}{\partial O_j}\frac{\partial O_j}{\partial I_j}$
Since $L$ depends on $w_{ij}$ through the following functional relationship $L(w_{ij}) = L\left(O_j\left(I_j(w_{ij})\right)\right)$, using the chain rule we get: $$ \frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial I_j}\frac{\partial I_j}{\partial w_{ij}} = \delta_j O_i$$ because $\frac{\partial I_j}{\partial w_{ij}} = \frac{\partial}{\partial w_{ij}} \left(\sum\limits_{k\in inc(j)} w_{kj} O_k\right) = O_i$, where $inc(j)$ is a set of neurons with outgoing edges to neuron $j$

Output layer¶

If neuron $j$ belongs to the output node, then error $\frac{\partial L}{\partial O_j}$ is calculated easily
For output layer $\delta_j$ are calculated directly: $$ \delta_j= \frac{\partial L}{\partial O_j}\frac{\partial O_j}{\partial I_j} = \frac{\partial L}{\partial O_j} f'(I_j) \qquad (1)$$
Example (single point $x$ and true vector of outputs $(y_1,\dots,y_{|OL|})$:
- For $L = \frac{1}{2}\sum\limits_{j\in OL}(O_j - y_j)^2$ and sigmoid activation $$ \frac{\partial L}{\partial O_j} = O_j - y_j $$
- Sigmoid activation function $O_j = \sigma(I_j)$: $$ f'(I_j) = \sigma(I_j)(1-\sigma(I_j)) = O_j(1-O_j) $$
- finally $$ \delta_j = (O_j - y_j)O_j(1-O_j)$$

Inner layer¶

If neuron $j$ belongs to some hidden layer, denote $out(j) = \{k_1, k_2, \dots, k_m\}$ the set of all neurons, receiving output of neuron $j$ as their input
The effect of $O_j$ on $L$ is fully absorbed by $I_{k_1},I_{k_2},\dots,I_{k_m}$, so $$ \frac{\partial L(O_j)}{\partial O_j} = \frac{\partial L(I_{k_1},I_{k_2},\dots,I_{k_m})}{\partial O_j} = \sum\limits_{k\in out(j)} \left( \frac{\partial L}{\partial I_k} \frac{\partial I_k}{\partial O_j} \right) = \sum\limits_{k\in out(j)} \left(\delta_k w_{jk}\right)$$
So for layers other than output layer we have: $$ \delta_j = \frac{\partial L}{\partial I_j} = \frac{\partial L}{\partial O_j}\frac{\partial O_j}{\partial I_j} = \sum\limits_{k\in out(j)} \left(\delta_k w_{jk}\right) f'(I_j) \qquad (2)$$
Weight derivatives are calculated using errors and outputs: $$ \frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial I_j}\frac{\partial I_j}{\partial w_{ij}} = \delta_jO_i \qquad (3)$$

Backprop¶

Forward propagate $x_n$ to the neural network, store all inputs $I_j$ and outputs $O_j$ for each neuron
Calculate $\delta_i$ for all $i \in$ output layer using $(1)$ $$ \delta_j = \frac{\partial L}{\partial O_j} f'(I_j) $$
Propagate $\delta_i$ from final layer back layer by layer $(2)$ $$ \delta_j = \sum\limits_{k\in out(j)} \left(\delta_k w_{jk}\right) f'(I_j)$$
Using calculated deltas and outputs calculate $\frac{\partial L}{\partial w_{ij}}$ with $(3)$ $$ \frac{\partial L}{\partial w_{ij}} = \delta_jO_i $$ And update weights

Multiple local optima problem¶

Optimization problem for neural nets is non-convex.
Different optima will correspond to:
- different starting parameter values
- different training samples
So we may solve task many times for different conditions and then
- select best model
- alternatively: average different obtained models to get ensemble
And/Or use some complex optimization methods

Vanishing Gradient Problem¶

In [4]:

x = np.linspace(-10, 10, 1000)
gr_sigm = sigmoid(x)*(1-sigmoid(x))

plt.plot(x, gr_sigm)

Out[4]:

[<matplotlib.lines.Line2D at 0x120210da0>]

Feature scaling
Careful weight initialization
Using ReLU activation function

Model complexity and overfitting¶

Constrain model directly:
- constrain number of neurons
- constrain number of layes
- impose constraints on weights
Take a flexible model
- early stopping (with validation set)
- L2 regularization $$ L(w) + \lambda\sum_i w_i^2 $$
Augmentation (more used in convnets)

Other things to know¶

Weight Initialization¶

Poor weight initialization can lead to vanishing or exploding gradient problem
Variance of a layer affects variance of the next one
Initialize weights in such manner to keep variance constant (Xavier initialization)
More here

Dropout technique¶

Have L1 and L2 regularization for weights
Can complement it with Dropout
Training: Dropout can be interpreted as sampling a Neural Network within the full Neural Network
Testing: not applied

Batch Normalization¶

Bad Weight Initialization
Vanishing Gradients
Normalize data right before non-linearities (or after ?!)

Conclusion¶

Advantages of neural networks:
- can model accurately complex non-linear relationships
- easily parallelizable
Disadvantages of neural networks:
- hardly interpretable ("black-box" algorithm)
- optimization requires skill
  - too many parameters
  - may converge slowly
  - may converge to inefficient local minimum far from global one

Data Analysis

Andrey Shestakov (avshestakov@hse.ru)

Neural Networks 11

Let's recall previous lecture¶

Boosting, Ensembles¶

Neural Networks¶

History¶

History¶

Simple model of a neuron¶

Multilayer perceptron architecture¶

Layers¶

Continious activations¶

Typical activation functions¶

Typical activation functions¶

Activation function zoo¶

TensorFlow PlayGround¶

Output Generation¶

Output generation¶

Definition details¶

Activations at output layer¶

Generalizations¶

Number of layers selection¶

Number of layers selection¶

Neural network optimization¶

Network optimization: regression¶

Network optimization: classification¶

Neural network optimization¶

Backpropagation algorithm¶

Idea¶

Definitions¶

Output layer¶

Inner layer¶

Backprop¶

Multiple local optima problem¶

Vanishing Gradient Problem¶

Model complexity and overfitting¶

Other things to know¶

Weight Initialization¶

Dropout technique¶

Batch Normalization¶

Conclusion¶

Useful Refs¶

Neural Networks 1¹

History ¶

TensorFlow PlayGround ¶