Autoencoders and unsupervised learning - Large Scale Machine Learning with Python

Up until now, we discussed neural networks with multiple layers and a wide variety of parameters to optimize. The current generation of neural networks that we often refer to as deep learning is capable of more; it is capable of learning new features automatically so that very little feature engineering and domain expertise is required.

These features are created by unsupervised methods on unlabeled data later to be fed into a subsequent layer in a neural network. This method is referred to as (unsupervised) pretraining. This approach has been proven to be highly successful in image recognition, language learning, and even vanilla machine learning projects.

The most important and dominant technique in recent years is called denoising autoencoders and algorithms based on Boltzmann techniques. Boltzmann machines, which were the building blocks for Deep Belief Networks (DBN), have lately fallen out of favor in the deep learning community because they turned out to be hard to train and optimize. For this reason, we will focus only on autoencoders. Let's cover this important topic in small manageable steps.

Autoencoders

We try to find a function (F) that has an output as its input with the least possible error F(x)≈ 'x. This function is commonly referred to as the identity function, which we try to optimize so that x is as close as possible to 'x. The difference between x and 'x is referred to as reconstruction error.

Neural Networks and Deep Learning

Let's look at a simple single-layer architecture to get an intuition of what's going on.

We will see that these architectures are very flexible and need careful tuning:

Single-layer autoencoder architecture

It is important to understand that when we have fewer units in the hidden layer than the input space, we force the weights to compress the input data.

In this case, we have a dataset containing five features. In the middle is a hidden layer containing three units (Wij). These units have the same property as the weight vector that we have seen in neural networks; namely, they are made up of weights that can be trained with backpropagation. With the output of the hidden layer, we get the feature representations as output by the same feedforward vector operations

Chapter 4 The process of calculating the vector 'x is quite similar to what we have seen with forward propagation by calculating the dot products of the weight vectors of each layer:

W=the weights

The reconstruction error can be measured with the squared error or in cross-entropy form, which we have seen in so many other methods. In this case, ŷ represents the reconstructed output and y the true input:

An important notion is that, with only one hidden layer, the dimensions in the data captured by the autoencoder model approximate the results of Principal Component Analysis (PCA). However, an autoencoder behaves much differently if there is non-linearity involved. The autoencoder will detect different latent factors that PCA will never be able to detect. Now that we know more about the architecture of the autoencoder and how we can calculate the error from its identity approximation, let's look at these sparsity parameters with which we compress the input.

You might ask: why do we even need this scarcity parameter? Can't we just run the algorithm to find the identity function and move on?

Unfortunately, it is not quite that simple. There are cases where the identity function projects the input almost perfectly but still fails to extract the latent dimensions of the input features. In that case, the function simply memorizes the input data instead of extracting meaningful features. We can do two things. First, we deliberately add noise to the signal (denoising autoencoders) and second, we introduce a sparsity parameter, forcing the deactivation of weakly-activated units. Let's first look at how sparsity works.

Neural Networks and Deep Learning

We discussed the activation threshold of a biological neuron; we can think of a neuron as being active if its potential is close to 1 or being inactive if its output value is close to 0. We can constraint the neurons to be inactive most of the time by increasing the activation threshold. We do this by decreasing the average activation probability of each neuron/unit. Looking at the following formula, we can see how we can minimize the activation threshold:

( )²

( )

ˆ

1

^m _j ⁱ

p a x

m

 

= ∑  

p̂_j: The average activation threshold of each neuron in the hidden layer.

ρ: The desired activation threshold of the network, which we specify upfront. In most cases, this value is set at .05.

a: The weight vector of the hidden layers.

Here, we see an opportunity for optimization by penalizing a training round on the error rate between p̂_j and ρ.

In this chapter, we will not worry too much about the technical details of this optimization objective. In most packages, we can use a very simple instruction to do this (as we will see in the next example). The most important thing to understand is that with autoencoders, we have two main learning objectives: minimizing the error between the input vector x and output vector 'x by optimizing the identity function, and minimizing the difference between the desired activation threshold and average activation of each neuron in the network.

The second way in which we can force the autoencoder to detect latent features is by introducing noise in the model; this is where the name denoising autoencoders comes from. The idea is that by corrupting the input, we force the autoencoder to learn a more robust representation of the data. In the upcoming example, we will simply introduce Gaussian noise to the auto-encoder model.

Chapter 4 Really deep learning with stacked denoising autoencoders – pretraining for classification

With this exercise, you will set yourself apart from the many people who talk about deep learning and the few who actually do it! Now we will apply an autoencoder to the mini version of the famous MNIST dataset, which can conveniently be loaded from within Scikit-learn. The dataset consists of pixel intensities of 28 x 28 images of handwritten digits. The training set has 1,797 training items with 64 features with a label for each record containing the target label digits from 0 to 9. So we have 64 features with a target variable consisting of 10 classes (digits from 0-9) to predict.

First, we train the stacked denoising autoencoder model with a sparsity of .9 and inspect the reconstruction error. We will use the results from deep learning research papers as a guideline for the settings. You can read this paper for more information (http://arxiv.org/pdf/1312.5663.pdf). However, we have some limitations because of the enormous computational load for these types of models. So, for this autoencoder, we use five layers with ReLU activation and compress the data from 64 features to 45 features:

I 2016-04-20 05:13:37 downhill.base:232 RMSProp 2639 loss=0.660185 err=0.645118

I 2016-04-20 05:13:37 downhill.base:232 RMSProp 2640 loss=0.660031 err=0.644968

I 2016-04-20 05:13:37 downhill.base:232 validation 264 loss=0.660188 err=0.645123

I 2016-04-20 05:13:37 downhill.base:414 patience elapsed!

I 2016-04-20 05:13:37 theanets.graph:447 building computation graph I 2016-04-20 05:13:37 theanets.losses:67 using loss: 1.0 *

MeanSquaredError (output out:out)

I 2016-04-20 05:13:37 theanets.graph:551 compiling feed_forward function

Now we have the output from our autoencoder that we created from a new set of compressed features. Let's look closer at this new dataset:

X_dAE.shape

Output: (1437, 45)

Neural Networks and Deep Learning

Here, we can actually see that we have compressed the data from 64 to 45 features.

The new dataset is less sparse (meaning fewer zeroes) and numerically more continuous. Now that we have our pretrained data from the autoencoder, we can apply a deep neural network to it for supervised learning:

#By default, hidden layers use the relu transfer function so we don't need to specify #them. Relu is the best option for auto-encoders.

# Theanets classifier also uses softmax by default so we don't need to specify them.

net = theanets.Classifier(layers=(45,45,45,10))

autoe=net.train([X_dAE, y_train], algo='rmsprop',learning_

rate=.0001,batch_size=110,min_improvement=.0001,momentum=.9, nesterov=True,num_updates=1000)

## Enjoy the rare pleasure of 100% accuracy on the training set.

OUTPUT:

I 2016-04-19 10:33:07 downhill.base:232 RMSProp 14074 loss=0.000000 err=0.000000 acc=1.000000

I 2016-04-19 10:33:07 downhill.base:232 RMSProp 14075 loss=0.000000 err=0.000000 acc=1.000000

I 2016-04-19 10:33:07 downhill.base:232 RMSProp 14076 loss=0.000000 err=0.000000 acc=1.000000

Before we predict this neural network on the test set, it is important that we apply the autoencoder model that we have trained to the test set:

dAE_model=model.train([X_test],algo='rmsprop',input_noise=0.1,hidden_

l1=.001,sparsity=0.9,num_updates=100) X_dAE2=model.encode(X_test)

X_dAE2=np.asarray(X_dAE2, 'float32') Now let's check the performance on the test set:

final=net.predict(X_dAE2)

from sklearn.metrics import accuracy_score print accuracy_score(final,y_test)

OUTPUT: 0.972222222222

We can see that the final accuracy of the model with auto-encoded features (.9722) outperforms the model without it (.9611).

Chapter 4

Summary

In this chapter, we looked at the most important concepts behind deep learning together with scalable solutions.

We took away some of the black-boxiness by learning how to construct the right architecture for any given task and worked through the mechanics of forward propagation and backpropagation. Updating the weights of a neural network is a hard task, regular stochastic gradient descent can result in getting stuck in global minima or overshooting. More sophisticated algorithms like momentum, ADAGRAD, RPROP and RMSProp can provide solutions. Even though neural networks are harder to train than other machine learning methods, they have the power of transforming feature representations and can learn any given function (universal approximation theorem). We also dived into large scale deep learning with H2O, and even utilized the very hot topic of parameter optimization for deep learning.

Unsupervised pre-training with auto-encoders can increase accuracy of any given deep network and we walked through a practical example within the theanets framework to get there.

In this chapter, we primarily worked with packages built on top of the Theano framework. In the next chapter, we will cover deep learning techniques with packages built on top of the new open source framework Tensorflow.

Deep Learning with

在文檔中 Large Scale Machine Learning with Python (頁 190-198)