When I was learning how to use CNN to tackle the Carvana Competition on Kaggle, I came across with ENet that was used for Semantic Segmentation. I skimmed through a paper and looked at some sample code online, and I started writing my code for the ENet. If x is my input, then \(z\)=ENet(\(x\)) is the set of logits I used for the calculation of the sigmoid cross entropy. I was not sure if I have made any mistakes in the codes, but using the Xavier initialization of the weights, the output \(Z\) are really large - about \(10^{40}\) or more. As ENet is composed of many "bottleneck" layers, I found out that after each pass of these bottleneck layers, the scale of the output is increased by a factor of 10. Thus, I tried to "manually" divide the output by 10 and the ENet finally kinda worked (far from perfectly, but at least the loss function is not NaN anymore and it is decreasing when training.
Because of this, I would like to investigate the effect of training and performance with different scales of logits using the famous MNIST data. (I cannot find anything online about this topics - but I think this is a very common question.)
The logistic function maps an input \(z\in\mathbb{R}\) to \(\sigma(z)\in (0,1)\). We can interpret \(\sigma(z)\) as a probability of the unit being turned on, as the logit \(\sigma(z)\) is between 0 and 1. Note that
If we scale \(z\) by a factor of \(\alpha\), the above properties won't change. But the change of the probability will be more/less sensitive.
For example, let \(z=1\). Then \(\sigma(0.1z)\approx 0.52,\,\sigma(z)\approx 0.73\) and \( \sigma(10z)\approx 0.99995\). Note that a positive \(z\) as above is interpreted as the unit being more likely to be turned on. The scale control "how" more likely. We can see that for \( \sigma(0.1z)\) it is just very slightly more likely, but for \(\sigma(10z)\) it is super-duper likely that the unit is turned on. Refer to the graphs below.
Thus scaling the logits \(z\) changes the sensitivity of the output.
The softmax function \(y = softmax(z)\) is a generalization of the above logistic function. Instead of thinking the unit being turned on and off, we say that $z$ can be in two states - "on" or "off". The sum of theprobability of the unit being "on" and the probability of the unit being "off" should be 1. Let \(y_{on} = \sigma(z)\) be the logistic output of the logit \(z\). Then we should have \(y_{off} = 1 - \sigma(z)\). Note that $$y_{on}=\sigma(z) = \frac{1}{1+e^{-z}} = \frac{e^z}{e^z + 1} = \frac{e^z}{e^z + e^0}$$ and $$y_{off}= 1 - \sigma(z) = 1 - \frac{1}{1+e^{-z}} = \frac{e^{-z}}{1 + e^{-z}} = \frac{1}{e^z + 1} = \frac{e^0}{e^z + e^0}$$
Thus we can think of the input \(z\) corresponds to \(z_{on} = z\) and \(z_{off} = 0\), with $$ y_{on} = \frac{e^{z_{on}}}{e^{z_{on}} + e^{z_{off}}}\mbox{ and } y_{off} = \frac{e^{z_{off}}}{e^{z_{on}} + e^{z_{off}}}$$
The advantage of this point of view is that if the output has several states, instead of just 2 (being either on or off), we can use the above formula to convert numbers into probabilities, such that a larger number of the state input corresponds to a larger probability of that state.
Let \(z_i\) be the input of the \(i\)th state and \(y_i\) be the probabilities. Then we have $$ y_i = \frac{e^{z_i}}{\sum_j e^{z_j}} = \frac{e^{z_i}}{e^{z_1} + \ldots e^{z_n}} \quad\quad\mbox{(number of states}=n)$$
Let's do an example. Let's say there are 4 states such that \((z_1, z_2, z_3, z_4) = (2,2,4,-1)\). We can see that there state 1 and state 2 should happen equally likely, but state 3 is most likely to happen. We can use the above formulae to calculate the probabilities \(y_i\) as below, $$e^{z_1} + e^{z_2} + e^{z_3} + e^{z_4} = e^2 + e^2 + e^4 +e^{-1}\approx 69.74$$ Thus $$y_1 \approx \frac{e^2}{69.74}\approx 0.106, \quad\quad y_2 \approx \frac{e^2}{69.74} \approx 0.106,\quad\quad y_3 \approx \frac{e^4}{69.74} \approx 0.783,\quad\quad y_4 \approx \frac{e^{-1}}{69.74}\approx 0.005$$
We can see that it agrees with our predictions. To simplify notation, we write \(y = (y_1, \ldots, y_n)\) and \(z=(z_1,\ldots, z_n)\) and \(y = softmax(z)\). Note that both \(y\) and \(z\) are vectors here. We will still be referring \(z\) as logits.
A nice property of softmax is that it is translation invariant, i.e. \(softmax(z + C) = softmax(z)\). It can be easily shown as $$softmax(z+C) = \left(\frac{e^{z_i+C}}{\sum_j e^{z_j+C}}\right) = \left(\frac{e^{z_i}e^C}{\sum_j e^{z_j}e^C}\right) = \left(\frac{e^{z_i}}{\sum_j e^{z_j}}\right)=softmax(z)$$
Of course \(C\) has to be the same across all states, i.e. \(C\) is broadcasted to \(z_j\).
Although the softmax function is translation invariant, it is not scale invariant. Again, just like the logistic function, it control the sensitivity of the probability output \(y_i\). Using the example above, if \(z = (2,2,4,1)\), then
Scaling with a larger number will result in very close to hardmax (hence the name softmax) while scaling with a very small number will be close to the point where the states are all equally likely to happen.
After a short introduction, we can start with some theoretical analysis with the MNIST data. We let
Then the loss function is given by $$L(x;\theta)=-\sum y_i \log \hat{y}_i$$
To reduce the loss, i.e. to train the model, we looked at the derivatives of \(L\) with respect to \(\theta\). We first have $$\dfrac{\partial L}{\partial z} =\hat{y}- y$$ (Here both \(y\) and \(z\) are vectors.) Thus by chain rules, we have $$\dfrac{\partial L}{\partial \theta} =(\hat{y} - y) \dfrac{\partial }{\partial \theta}\,f(x;\theta)$$ (Note that all of these are indeed vectors and matrices)
Now we are trying to scale the logits, i.e. let
We then have $$\dfrac{\partial \tilde{L}}{\partial z} =\alpha(\hat{\tilde{y}}- y)$$
This means $$\dfrac{\partial \tilde{L}}{\partial \theta} =\alpha(\hat{\tilde{y}} - y) \dfrac{\partial }{\partial \theta}\,f(x;\theta)$$
This means that if we scale our logits by a factor of \(\alpha\), the gradient is also (apparently) scaled by a factor of \(\alpha\). Of course it is not that simple, because if we scaled the logits, our probability vectors \(\hat{y}\) changed and thus our losses also changed. It is very difficult to analyze how the training would turn out, but we can try to ignore the changes in our losses and just think that the gradient is just scaled by a factor of \(\alpha\). Since we used the gradient for our learning, scaling the logits by a factor of \(\alpha\) is similar to scaling our learning rate by a factor of \(1/\alpha\), ignoring the (huge) possible effects of the changes in the loss function.
In the most popular initializations of the weights \(\theta\), the scaled logits \(\alpha z\) (for large \(\alpha\)) may be very large and thus the loss function \(L\) is also very large. This is because $$\log \hat{y}_i =\log \frac{e^{z_i}}{\sum_j e^{z_j}} =\log\frac{1}{1+\sum_{j\neq i} e^{z_j−z_i}} =−\log\left(1+\sum_{j\neq i} e^{z_j−z_i}\right)$$
When \(z\) is replaced by \(\alpha z\), we have $$\log \hat{\tilde{y}}_i = -\log\left(1+\sum_{j\neq i} e^{\alpha(z_j - z_i)}\right)$$
Assume that for some \(\alpha(z_j - z_i)\) is positive and huge (say when \(j=J\) ), then $$-\log\left(1+\sum_{j\neq i} e^{\alpha(z_j - z_i)}\right)\approx -\log e^{\alpha(z_J - z_i)} = -\alpha (z_J - z_i).$$
This means that if we adjusted our learning rate by the scale \(\alpha\) (divided by \(\alpha\) ), the decrease in \(L\) would be small. Thus it should not be wise to adjust our learning rate when \(\alpha\) is large.
If \(\alpha\) is small, then we have $$\log \hat{\tilde{y}}_i = -\log\left(1+\sum_{j\neq i} e^{\alpha(z_j - z_i)}\right) \approx -\log\left(1+\sum_{j\neq i} 1\right)\approx -\log 10$$ which is close to a constant. Thus \(L\) very insensitive to changes. This seems to suggest that adjusting the learning rate is necessary. (Like dividing by \(\alpha\)
It is anticipated the training would (hopefully) eventually train our scaled logits to a specific confortable range, no matter what \(\alpha\) is. This is because if our initial prediction is incorrect, (if it is correct, why would we train?), then the logits must have to flip the sign (crossing 0). Thus if \(\alpha\) is huge, after training the network for a while, we would expect our weights \(\theta\) to be very small (hence our loss \(L\) would fall into a desirable range). The unscaled logits \(z\) will be small and \(\alpha z\) would fall into the desired range. This means \(\frac{\partial f}{\partial \theta}\) should stay in the same order of magnitude. If we adjust the learning rate as above, \(L\) should decrease accordingly. This seems to suggest that for a large scale \(\alpha\), we should adjust the learning rate in the beginning but we can adjust it after the scaled logits falls into the desired range.
After the above preliminary analysis, we are going to do some tests to see how the scaling of the logits would affect our training.
We are doing the followings,
We are investigating several archictecture including a fully connected net and three CNNs. We will be focusing on the fully connected set.
Fully Connected:
Convolutional Neural Network 1:
Convolutional Neural Network 2:
Convolutional Neural Network 3:
We will be testing different cases based on the followings,
Learning Rate = 0.001 | Learning Rate = 0.0001 | |
Learning Rate Adjusted | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 | |
Learning Rate Adjusted | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 | |
Learning Rate Adjusted | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 | |
Learning Rate Adjusted | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Learning Rate = 0.0001 | Learning Rate = 0.000001 |
![]() |
![]() |
Fixed Random State | Random State |
![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 | |
Gradient Descent Optimizer | ![]() |
![]() |
Adam Optimizer | ![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 |
![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.0001 |
![]() |
![]() |
Learning Rate = 0.0001 | Learning Rate = 0.000001 |
![]() |
![]() |
Accuracy | StDev: Scaled Logits |
---|---|
![]() |
![]() |
Accuracy | StDev: Scaled Logits |
---|---|
![]() |
![]() |
Accuracy | StDev: Scaled Logits |
---|---|
![]() |
![]() |
Accuracy | StDev: Scaled Logits |
---|---|
![]() |
![]() |
Gradient Descent Optimizer | Adam Optimizer | |
Scale Set A | ![]() |
![]() |
Scale Set B | ![]() |
![]() |
Learning Rate = 0.001 | Learning Rate = 0.00001 | |
Training Error | ![]() |
![]() |
---|---|---|
Test Error | ![]() |
![]() |
Since the training time is much longer for our CNN models, we are just run 10 epochs and we just have a few graphs shown below.
Accuracy | Scaled Logits StDev | |
---|---|---|
Original | ![]() |
![]() |
Learning Rate = 0.1 | ![]() |
![]() |
Learning Rate = 0.01 | ![]() |
![]() |
Learning Rate = 0.0001 | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Scale Set B | ![]() |
![]() |
Accuracy | Scaled Logits StDev | |
---|---|---|
Original | ![]() |
![]() |
Learning Rate = 0.001 | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Scale Set B | ![]() |
![]() |
Random State | ![]() |
![]() |
Accuracy | Scaled Logits StDev | |
---|---|---|
Original | ![]() |
![]() |
Learning Rate = 0.001 | ![]() |
![]() |
Learning Rate Unadjusted | ![]() |
![]() |
Scale Set B | ![]() |
![]() |
Random State | ![]() |
![]() |
We can say with some (vague) statements.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import tensorflow as tf
import time
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
df = pd.read_csv('../input/train.csv')
Xtrain, Xtest, ytrain, ytest = train_test_split(df.iloc[:,1:],
df.iloc[:,0],
train_size=0.98,
test_size=0.02,
random_state=0)
Xtrain=np.array(Xtrain).reshape(-1,28,28,1)
Xtest=np.array(Xtest).reshape(-1,28,28,1)
enc = OneHotEncoder()
ytrain= enc.fit_transform(np.array(ytrain).reshape(-1,1)).toarray()
ytest= enc.transform(np.array(ytest).reshape(-1,1)).toarray()
#Several network architecture. the input x is the input placeholder and return a logit tensor.
def fc(x):
n_train = tf.shape(x)[0]
net = tf.reshape(x, [n_train, 784])
net = tf.contrib.layers.fully_connected(net, 196, activation_fn=tf.nn.relu)
net = tf.contrib.layers.fully_connected(net, 49, activation_fn=tf.nn.relu)
return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
def cnn1(x):
net = tf.layers.conv2d(x,8,[5, 5],strides=(1,1),padding="valid",activation=tf.tanh)
net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
net = tf.layers.conv2d(net,32, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="same", activation=None)
n_train = tf.shape(net)[0]
net = tf.reshape(net, [n_train, 16])
return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
def cnn2(x):
net = tf.layers.conv2d(x , 4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,4, [5,5], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="valid", activation=tf.tanh)
n_train = tf.shape(net)[0]
net = tf.reshape(net, [n_train, 16])
return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
def cnn3(x):
net = tf.layers.conv2d(x,16,[3, 3],strides=(1,1),padding="valid",activation=tf.tanh)
net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
net = tf.layers.conv2d(net,64, [4,4], strides=(1,1), padding="valid", activation=tf.tanh)
net = tf.layers.max_pooling2d(inputs=net, pool_size=[2, 2], strides=2)
net = tf.layers.conv2d(net,1, [1,1], strides=(1,1), padding="same", activation=None)
n_train = tf.shape(net)[0]
net = tf.reshape(net, [n_train, 25])
return tf.contrib.layers.fully_connected(net, 10, activation_fn=None)
def runnet(Xtrain, Xtest, ytrain, ytest,
func,
epochs = 5,
scale = 1,
base_lr = 0.001,
learning_rate_adjusted = False,
op = 'GD',
seed=None):
#Check Validity of the above defined functions
if func not in [cnn1,cnn2,cnn3,fc]:
print('Input Function Incorrect!')
return
if op not in ['GD','AD']:
print('Optimizers are GradientDescentOptimizer ("GD") or AdamOptimizer ("AD")')
return
#Enable Reverse One-Hot By Multiplying numlist
numlist=np.array([0,1,2,3,4,5,6,7,8,9])
tf.reset_default_graph()
tf.set_random_seed(seed)
x = tf.placeholder(tf.float32, shape=[None, 28,28,1])
y = tf.placeholder(tf.float32, shape = [None, 10])
lr = tf.placeholder(tf.float32, shape = [])
out = func(x) * scale #Scaled Logits
loss=tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=out))
if op == 'GD':
train_step = tf.train.GradientDescentOptimizer(lr).minimize(loss)
elif op == 'AD':
train_step = tf.train.AdamOptimizer(lr).minimize(loss)
bs = 1024 #batch_size
#For plotting purposes
acclist=[]
stdlist=[]
#trainlist=[]
#testlist=[]
if learning_rate_adjusted:
rate = base_lr / scale
else:
rate = base_lr
with tf.Session() as sess:
sess.run(tf.global_variables_initializer())
#initial error
er=sess.run(loss, feed_dict={x:Xtrain[:1], y:ytrain[:1], lr: 0})
print('Running: Scale={}, BaseLR={}, LR_Adjusted={}, Seed={}, Optimizer={}'.format(
scale, base_lr, learning_rate_adjusted, seed, op))
print('Initial Logits StDev: ',np.std(sess.run(out, feed_dict={x:Xtest})))
for i in range(epochs):
tic = time.time()
#ignore last incomplete batch
for j in range(len(Xtrain)//bs):
_, er=sess.run([train_step, loss], feed_dict={
x:Xtrain[j*bs:(j+1)*bs],
y:ytrain[j*bs:(j+1)*bs],
lr: rate})
ertest, pred = sess.run([loss,out], feed_dict={x:Xtest, y:ytest})
stdev = np.std(pred)
#Prediction Before Reverse One-Hot
pred=np.argmax(pred, axis=1)
#Accuracy of Prediction After Reverse One-Hot
acc=(pred==np.dot(ytest, numlist)).sum()/len(ytest)*100
if j%3==0:
#print('Training Error: {:8.4f}\tTest Error: {:8.4f}\tAccuracy: {:5.2f}%'.format(er, ertest, acc))
acclist.append(acc)
stdlist.append(stdev)
#trainlist.append(er)
#testlist.append(ertest)
toc = time.time()
print('Epoch {}\tTraining Error: {:8.4f}\tTest Error: {:8.4f}\tAccuracy: {:5.2f}%\tTime: {:6.2f}s'.format(
i+1,er, ertest, acc, toc- tic))
tic = toc
return acclist, stdlist#, trainlist, testlist
lrlist = [1e-2,1e-3,1e-4,1e-6]
bstr = ["1e-2","1e-3","1e-4","1e-6"]
scalelistlist = [[0.1,0.3,1,3,10],[0.01,0.1,1,10,100],[1e-8,1e-6,1e-4,1e-2,1],[1,100,1e4,1e6,1e8]]
sstr = ["A","B","C","D"]
#[0.1,0.3,1,3,10] #A
#[0.01,0.1,1,10,100] #B
#[1e-8,1e-6,1e-4,1e-2,1] #C
#[1,100,1e4,1e6,1e8] #D
op = "GD"
ep=10
adj=True
sd=0
for j in range(4):
for k in range(4):
ACC = []
STD = []
base_lr = lrlist[j]
scalelist = scalelistlist[k]
for scale in scalelist:
acclist, stdlist = runnet(Xtrain, Xtest, ytrain, ytest,
fc,
base_lr=base_lr,
epochs = ep,
scale=scale,
learning_rate_adjusted = adj,
op='GD',
seed=sd)
ACC.append(acclist)
STD.append(stdlist)
plt.title('Fully Connected - Accuracy\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(
base_lr,adj,ep,sd))
for i in range(len(ACC)):
if len(ACC[i])>0:
plt.plot(ACC[i], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.show()
plt.title('Fully Connected - Logits StDev\nBaseLR={}, LR Adjusted={}, Epochs={}, Seed={}'.format(
base_lr,adj,ep,sd))
for i in range(len(STD)):
if len(STD[i][20:])>0:
plt.plot(STD[i][20:], label='Scale ='+str(scalelist[i]))
plt.legend()
plt.show()