softmax output - Deep Learning in Python Prerequisites

What happens when our output is more than 2 classes? Ex. MNIST or character recognition. With binary classification, we only needed 1 output node, because P(y=0 | x)

Recall that a0 and a1 can be either negative or positive. Since probabilities must be 0 or positive, we can enforce positivity by exponentiating a.

So we have 2 outputs, exp(a1) and exp(a0). How do we ensure these sum to 1?

Simply divide by exp(a1) + exp(a0).

into a weight matrix, W = [w1 w2 … wK ] we can simply write:

A = XW

Y = softmax(A)

Neurons

Sometimes, logistic regression is referred to as the “logistic unit” or neuron. Why? It has a few properties in common with the biological neuron. Appropriately, when you hook up a bunch of neurons / logistic units together, you get a neural network.

I discuss the similarity between digital neurons and biological neurons more in depth in my next book, Deep Learning in Python: Master Data Science and Machine Learning with Modern Neural Networks written in Python, Theano, and TensorFlow.

Exercise

N x 1 array where the elements are values from 0..9. You will need to turn it into an indicator matrix of size N x K where the values are 0 or 1.

Write the code to initialize the weights W to come from a Gaussian-distributed array of samples, and write a function that takes in X and W and outputs a prediction Y.

Chapter 6: Maximum Likelihood Estimation

L = product_from_i=1..N { (1 / sqrt(2pi)) * exp( -(1/2) (xi - mu)² ) }

We would like to maximize L with respect to mu, i.e. maximize the likelihood over the entire training set. (usually just called the log-likelihood) and maximize that. These functions are usually

easier to optimize after taking the log.

Now it is easy to see why we use objective functions like the squared error.

With linear regression, we assumed y ~ N(w^Tx, some_variance). (In English, this means y is normally distributed with mean equal to w^Tx and variance equal to some_variance).

Another way of writing that is y = w^Tx + noise, where noise ~ N(0, some_variance).

What happens when you set up the likelihood and take the log? You simply get the squared error!

(Technically, you get the negative of the squared error)

Maximizing the likelihood is thus the same as minimizing the squared error objective.

Sigmoid

We do something a little different for the sigmoid / logistic regression. The error isn’t really normally distributed, since the output can only be between 0 and 1. The likelihood here is more like a coin toss.

L = product_from_i=1..N { p^t(i) (1-p)^(1 - t(i)) }

For the output of a logistic:

L = product_from_i=1..N { y(i)^t(i) (1-y(i))^(1 - t(i)) }

If you took the negative-log of this you would get the cross-entropy error:

J = -sum_from_i=1..N { ti log(yi) + (1 - ti) log(1 - yi) }

Where I’m using t(i) = ti interchangeably due to limitations in the output format of this book.

Softmax

Whereas the sigmoid is like a coin toss, softmax is like rolling a die.

Since there are K possibilities (K output classes) in softmax, we need to consider all of them. Usually the target variables are represented by an indicator matrix, so t[i, k] = 1 if the ith sample belongs to the kth class.

Note that this means both the targets and the model output, considering all data point simultaneously, would be matrices of size N x K.

Performing the same process as in the previous 2 sections, we would arrive at the objective function:

J = -sum_from_i=1..N { sum_from_k=1..K { t[i,k] * log(y[i,k]) } }

Exercise

Write a function to calculate the cost function for our MNIST example. It should look like this:

def cost(T, Y):

return -( T * np.log(Y) ).sum()

Chapter 7: Gradient Descent

Now that we have our likelihood functions, what comes next? Recall that for linear regression, we were able to solve for w directly in terms of X and Y.

Because of the “nonlinearity” (the sigmoid) we used in logistic regression, this is no longer possible. Instead, we use a more general numerical optimization technique called

“gradient descent”. simply going to provide you with the solution, but I would highly recommend teaching yourself how to arrive at the solution yourself.

dJ / dw = X^T(Y - T)

These are the full data matrices and the same formula works for both sigmoid and softmax. Convince yourself that the right-side outputs a vector of size D x 1 when the output is a sigmoid, and a matrix of size D x K when the output is softmax.

Exercise

Write a complete logistic regression classifier that can do both learning and prediction, and use it on a dataset like MNIST to see what accuracy you can get.

It should look something like this:

def grad(Y, T, X):

return X.T.dot(Y - T)

for i in xrange(epochs):

Y = predict(X, W)

W -= learning_rate * grad(Y, T, X)

Also keep track of the cost at each iteration and plot it using matplotlib at the end. You should notice a steep decrease at the beginning but it should level out quickly.

Write a function to do linear regression with gradient descent. You should be able to find the derivative of the squared error quite easily since it’s just a quadratic.

Chapter 8: The XOR and Donut Problems

Logistic regression is really great for problems that have a linear boundary, since the weights define a line or a plane. There are some classical problems that linear classifiers can’t solve in their basic form, but I will show you how to modify logistic regression in order to do so.

First, let’s look at the XOR problem:

XOR is a logic gate like AND and OR. The outputs are defined as follows:

0 XOR 0 = 0 1 XOR 0 = 1 0 XOR 1 = 1 1 XOR 1 = 0

As you can see, there is no line that separates the two classes.

Now let us look at the donut problem:

As you can see, this problem also contains no linear boundary between the two classes.

So what do we do in these two cases?

For the XOR problem, create a third dimension that is derived from the first two, x3 = x1x2. Try to draw a 3-D plot to see how a plane could separate the two classes now.

For the donut problem, we see that the radius is a discriminating feature. So if we created a third dimension x3 = sqrt(x12 + x22), we would be able to draw a plane between the two classes.

What is the disadvantage of this?

We have to manually come up with features!

In practice, there are just way too many to consider.

Think about the street view house numbers dataset, where D = 3072. We would consider x1x2, then x1x3, then x1x4, etc…

This can also lead to overfitting.

The great advantage of deep learning and neural networks is that they automatically find features for us.

Exercise

Write code to generate the data for the XOR problem and the donut problem.

Use the logistic regression classifier with no hand-crafted features to prove to yourself that this yields a low classification rate.

Next, add the features I’ve described in this chapter and show that you can achieve almost-perfect (or perfect in the case of XOR) discrimination.

Conclusion

Do you want to learn more about deep learning? Perhaps online courses are more your style. I happen to have a few of them on Udemy. this book, I take you through the basics of Theano and TensorFlow - creating functions, variables, and expressions, and build up neural networks from scratch. I teach you about ways to accelerate the learning process, including batch gradient descent, momentum, and adaptive learning rates. I also show you live how to create a GPU instance on Amazon AWS EC2, and prove to you that training a neural network with GPU optimization can be

In this course I teach the theory of logistic regression (our computational model of the

https://www.udemy.com/sql-for-marketers-data-analytics-data-science-big-data

Finally, I am always giving out coupons and letting you know when you can get my stuff for free. But you can only do this if you are a current student of mine! Here are some ways I notify my students about coupons and free giveaways:

My newsletter, which you can sign up for at http://lazyprogrammer.me (it comes with a free 6-week intro to machine learning course)

My Twitter, https://twitter.com/lazy_scientist

My Facebook page, https://facebook.com/lazyprogrammer.me (don’t forget to hit “like”!)

在文檔中 Deep Learning in Python Prerequisites (頁 34-58)