9781786464453-PYTHON DEEP LEARNING


Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.
vision, AI, speech and data analysis
Valentino Zocca
Valentino Zocca
About the Authors
and its applications. He currently holds the position of Head of Data Science at
www.PacktPub.com
Did you know that Packt offers eBook versions of every book published, with PDF
editorial process. To help us improve, please leave us an honest review on this book's
If you'd like to join our team of regular reviewers, you can e-mail us at
eBooks and videos in exchange for their valuable feedback. Help us be relentless in
Table of Contents
Chapter 1: Machine Learning – An Introduction
60
60
Medical
60
Autonomous car driving
61
Business
61
Pattern recognition
61
Speech production
61
A convolutional layer example with Keras to recognize digits
A convolutional layer example with Keras for cifar10
Early game playing AI
Implementing a Python Tic-Tac-Toe game
Training AI to master Go
Deep learning in Monte Carlo Tree Search
Policy gradients in AlphaGo
A supervised learning approach to games
great deal of public attention. Every day, deep-learning algorithms are used broadly
Starting with a quick recap of important machine learning concepts, the book will
delve straight into deep learning principles using scikit-learn. Moving ahead, you
, covers two of the most powerful and

often-used architectures for unsupervised feature learning: auto-encoders and
restricted Boltzmann machines.
Chapter 5
cortex works and introduces convolutional layers, followed up with a descriptive
Chapter 6
Matplotlib
H2O .
scikit-learn
foundational understanding of machine learning concepts and some programming
pathnames, dummy URLs, user input, and Twitter handles are shown as follows:
The code above for drawing should be immediately clear, we just notice that the line
Choose from the drop-down menu where you purchased this book from.
Click on
Code Files
book's webpage at the Packt Publishing website. This page can be accessed by
entering the book's name in the
Search
to your Packt account.
using the latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at
PacktPublishing/Python-Deep-Learning
from our rich catalog of books and videos available at
PacktPublishing/
xii ]To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search �eld. The required
1 ]Machine Learning – An
Different machine learning approaches
Steps involved in machine learning systems
Brief description of popular techniques/algorithms
Applications in real-life
A popular open source package
important business e-mails and some of which are un-solicited junk e-mails, or spam.
function

learning's recent successes are related to it being so effective in unsupervised
New data is created every day, very quickly, and labeling all the new data is
quite a laborious and time-consuming activity. One advantage of unsupervised
9 ]A simple example of reinforcement learning can be used to play the classic game
of tic-tac-toe. In this case, each position on the board has associated a probability (a
value), which is the probability of winning the game from that state based on previous
11
12 ]Another example may be given by attempting to use machine learning to predict the
trajectory of a ball thrown from the ground up in the air (not perpendicularly) until
it reaches the ground again. Physics teaches us that the trajectory is shaped like a
parabola, and we expect that a good machine learning algorithm observing thousands
of such throws would come up with a parabola as a solution. However, if we were to
air due to turbulence,
we might notice that the ball does not hold a steady trajectory but may be subject to
small perturbations. This is what we call "noise". A machine learning algorithm that
tried to model these small perturbations would fail to see the big picture, and produce
s the process that makes the

ball thrown from the ground

techniques/algorithms
three classes discussed at the beginning of the book, supervised learning,
The following is by no means meant to be an exhaustive list or a thorough
input data to predict a value, for example the cost of a house
bottom graph the function has two local minima, and therefore, depending on the initialization,

the process may find the first local minimum that is not the global minimum.
16 ]Decision treesAnother widely used supervised algorithm is the decision tree algorithm. A decision tree algorithm creates a classi�er in the form of a "tree". A decision tree is comprised of decision nodes where tests on speci�c attributes are performed, and leaf nodes that indicate the value of the target attribute. Decision trees are a type of classi�er that works by starting at the root node and moving down through the decision nodes
how very different the decision tree algorithm is from linear

indicate where the new franchises should be located and their corresponding delivery areas
Probabilistically, what most machine learning techniques try to evaluate is the
of a certain event
For example, given the picture representing a digit (that is, a picture with a certain
into account the fact that the cancer is quite rare under 50, therefore the positivity of
p(cancer)
, or
more in general the probability p
for the outcome we are trying to estimate, is called
the prior probability, because it represents the probability of the event without any
At this point, we may wonder what would happen if we had more information, for
example if we performed a different test with different reliability, or knew some
information about the person such as recurrence of cancer in the family. In the
preceding equation we used, as one of the factors in the computation, the probability
, and if we performed a second test, and it came positive,
we would also have
p(test2=positive | cancer)
. The naïve Bayes technique makes
the assumption that each piece of information is independent of each other (this
p(test1 and test2=pos | cancer) =p(test1=pos | cancer)*p(test2=pos | cancer)
This equation is also called the likelihood
L(test1 and test2 = pos)
that
positive given the fact that the person does have cancer.
We can then rewrite the equation as:
p(cancer | both tests=pos) =
= p(both test=pos | cancer)*p(cancer)/p(both tests=pos) =
= p(test1=pos | cancer)*p(test2=pos | cancer) *p(cancer)/p(both tests=pos)

machines over other
machine learning algorithms is that not only does it separate the data into classes,
predictive ability of the algorithm. As we have discussed above, in practice it is
The kernel trick instead involves mapping the space of features into another space
the vector
x
general). Hence, any vector
0
()
ii
ax
ii
wx

()
ii
ax
=
the bias is included in this formula) and we have said that the neuron activates if the
activation is greater than
0
this is called the threshold activity, because the neuron activates when the activation
is greater than
news in 2016 by beating world Go champion Lee Sedol. In January 2016, AlphaGo
The strength of AlphaGo was that it had not been programmed to play, but it had
learned
thousands of games against itself using reinforcement
learning and deep learning techniques. The ability to learn is what renders machine
algorithm chosen, such as
naive_bayes
pred = mlp.predict(data_test)
pictures was adapted from similar code by Sebastian Raschka in his
Python Machine
Learning
plt.ylim(y.min(), y.max())
# plot the data
()
()
x0
fx
()
()
1e
fx
+
()
()
()
()
()
()
pe
pe
xx
fx
xx

==
+
logistic activation function, the code would modify to be:
activation = "logistic")
supervised learning, unsupervised learning, and reinforcement learning. Classical
and W. Pitts. A Logical

ii
xw

value of each input neuron, and

ii
xw
xb

absence of intra-layers connections, while neurons connect to each and every neuron
different types of

10
00
a
fa




certain value and it is called the threshold activity function
()
()
1e
fa
+
()
()
()
()
1e
1e
1e
fa

+
+
()
()
()
()
()
()
()
pe
1e
pe
1e
aa
fa
aa


+
+

00
fa
fa




counterpart, it is a mix of the identity and the threshold function, and it is
10
00
a
a


sigmoid function. This can be quickly calculated by simply noticing
that the derivative with respect to
a
1e
+
()
()
()
()
()
1e
)1
1e
)1
)1
)1
a
a
a
a
+
+
++
+
+
()
()
()
()
()
1e
1e
)1
)1
ff
+
=
+
+
+

1e
+

ii
xw
xb

the right is a sigmoid with weight 10 and bias 50.
to a step function. The larger
b
will be equal to the negative of the ratio
b/w
–b/w
function.
mapped to the output neuron with weights
w
logistic sigmoid activity function to each hidden neuron, and the identity function to
R
#as well as biasValue1 and biasValue2 to observe
import numpy
import matplotlib.pyplot as plt
print ("The step function starts at {0} and ends at {1}"
.format(-biasValue1/weightValue,
y1 = 1.0/(1.0 + numpy.exp(-weightValue*x - biasValue1))
y2 = 1.0/(1.0 + numpy.exp(-weightValue*x - biasValue2))
()
ii
yt

dimension of
w
w
()
ii
yt

()
()
()
ii
ii
ii
yt
yt

=
=
=
ii
yx

i
j
j
x
w

=

()
()
ii
ii
ij
yt
yt
indicates the i
indicates the

ii
jj
ij
ww
xy





ii
ww
yt
wJ




is what is often called gradient descent.
,,
ww



of writing the update rule for
w
components
where, instead of writing the partial derivative for each
cases, the weights can be updated
activation function is not going to be the identity function
like before, rather we are going to use the logistic sigmoid function. The
the logistic sigmoid function, therefore, for each example x, the probability that the
w



ai
Pt
ai
ft





,1
Pt
xw
aa


()
()
()
()
()
()
ii
Pt
xw
Pt
xw
aa










g,
g1
g1
g1
aa
ta



()
()
()
()
()
()
()
()
g1
g1
ta
ta
+
()
()
()
()
()
()
()
()
()
ii
ta
xt
ax
+
=
+
()
()
()
()
aa

=
()
j
a
=

kk
ww

()
()
()
()
()
()
()
()
()
()
ii
aa
ax

=
()
()
()
ii
ax

t
generalize this equation using
()
()
()
g,
Jw
Py
xw
=
()
()
,l
ij
Et


ii
jj
ij
ww
xa



i
belonging to the layer preceding the layer containing the element with subscript
j
In this example, layer 1 represents the input, and
layer 2
connecting the
jj
ij
jji
ya
JJ
wy
aw




ij
w

ij
jj
JJ
wy


jj
ij
jji
ya
JJ
wy
aw




The
function of its activation value, and, of course, the cost function is a function of the
activity function we have chosen.
ij
w

j
function that we can calculate, all we need to calculate is the derivative
ij
ji
yy
JJ
yy
yy
ay


=



j
ij
y

can calculate
backward and calculate
wj
ij
yy

equation should read as the sum over all the outgoing connections from
y
neuron
in the successive layer.
jj
ij
jji
ya
JJ
wy
aw




k
i
k
j
JJ
yy


jj
ya

activation value, and we can think of
then rewrite
jj
ij
ji
yy
JJ
yy
yy
ay







()
ij
ji
a

=
jj
ya

()
ij
ii
a

=
ij
ij
ww



,,
ij
ij
ji
ww

apply these concepts and formulas.
neurons, and for the output layer only one neuron:
65 ] #Finally, we adjust the weights, #using the backpropagation rules for i in range(len(self.weights)): layer = y[i].reshape(1, nn.arch[i]+1) delta = delta_vec[i].reshape(1, nn.arch[i+1]) self.weights[i] +=learning_rate*layer.T.dot(delta)This concludes our back-propagation algorithm; all that is left to do is to write a predict function to check the results: def predict(self, x): val = numpy.concatenate((numpy.ones(1).T, numpy.array(x)))
val = self.activity(numpy.dot(val, self.weights[i]))
66 ]Notice the use of numpy.random.seed(0). This is simply to ensure that the weight
separating the regions will be quite different depending on the architecture chosen.

, we introduced machine learning
and techniques that can be used to implement machine learning. In
Fundamental concepts of deep learning
Applications of deep learning
GPU versus CPU
Popular open source libraries
in Proceedings of
Neural Information Processing Systems (NIPS) (2012) and, at the end of their paper,
of the Ising model to deep learning. In

and positive, so the third neuron will also be active. In the second figure, the first two neurons are off,
and their connection with the third neuron is positive, so the third neuron will also be off.
In the second figure, the first two neurons are off, and their connection with the
third neuron is positive, so the third neuron will also be off.
negative, so the third neuron will also be off. In the second figure, the first two neurons are off, and their
connection with the third neuron is large and negative, so the third neuron will likely be on.
The point of introducing this adaptation of the Ising model is to understand how
and large but negative connections with the others.
how we can select weights for our connections, to have the hidden
neurons start recognizing features of our input.
it can recognize features. Another, even more important, is that it will recognize
are the pixels that have an
x
neuron representing the mouth in the next layer will turn on anyway.
of abstraction of basic elements in the image and how they are structured. This
represent lines that intersect, and more complex representations by associating neurons that represent lines at
specific angles.
Autoencoders
: A class of unsupervised learning algorithms in which the
Total number
DNN
(error rate)
GMM-HMM
with same
(error rate)
GMM-HMM
with longer
(error rate)
Switchboard
(test1)
309
18.5
27.4
18.6 (2000hrs)
Switchboard
309
16.1
23.6
17.1 (2000hrs)
English
Broadcast News
50
17.5
18.8
Bing Voice
Search
24
30.4
36.2
Google Voice
5870
12.3
��16.0 (5870hrs)
YouTube
1400
47.6
52.3
transformations of speech spectrograms. A spectrogram is a visual representation of
edges,
layer 5
shows entire objects.
ji
aw
. Inference is the post-training phase where we deploy our trained
DNN. In a whitepaper published by GPU vendor Nvidia titled
GPU-Based Deep
Learning Inference: A Performance and Power Analysis
, available online at
why most popular open source libraries like Theano or TensorFlow allow you to
92 ]We refer to the Theano documentation for learning the �rst steps on how to use Theano, and we will implement some test code examples using Theano for deep learning in this book.TensorFlow

a model, and we will use the
will use regular
, not sparse layers) specifying the number of input and output
neurons. For each layer, we specify the activity function of its neurons:
model.add(Dense(hidden_neurons, input_dim=input_size))
optimization (training rate, momentum, and so on). We are not going to modify the
default values, so we can simply pass:
neurons have learned:
import matplotlib.cm as cm
w = weights[0].T
for neuron in range(hidden_neurons):
cmap = cm.Greys_r)
plt.show()
for drawing above should be immediately clear; we just notice that the
following line is importing
the
what we had before:
model.add(Dense(3000, input_dim=input_size))
model.add(Activation('sigmoid'))
model.add(Dense(2000, input_dim=3000))
introduce the readers to many of the concepts we have touched on in this one, like

Feature Learning


sw
ss
ev



st
em
hf



blob/master/hands-on_training/images/autoencoder.png)
:,
Xc
cX


the network to �nd the
minimum information
LX




g1
g1
Lx
xx


The
at (1,3) with a standard deviation of 3 in the (0.866, 0.5) direction and of 1 in the orthogonal direction. The
directions represent the principal components (PC) associated with the sample. By Nicoguaro (own work) CC
BY 4.0 (http://creativecommons.org/licenses/by/4.0), via Wikimedia Commons.
not always enough.
Autoencoders have the advantage of being able to represent even non-linear
representations using a non-linear activation function.
One famous example of an autoencoder was given by MITCHELL, T. M. in his
book
corresponds exactly with the binary representation with three bits.
There are situations though where just the single hidden layer is not enough to
represent the whole complexity and variance of the data. Deeper architecture can
Autoencoder#/media/File:Autoencoder_structure.png)
In the case of binary inputs, we want to use sigmoid as the output activation
function and cross-entropy, or more precisely, the sum of Bernoulli cross-
For real values, we
mean squared error
For different types of input data(
x
approach, which consists of the following steps:
1.
Finding the probability distribution of observing x, given
u
2.
:,
Xc
cX


110
111
x
reconstructed output.
with a given probability
v
for numerical inputs.
noise level.
We will use the noisy variant,
x
input dimensionality is
will write the loss function
()
()
ii
Jd
Lx
112
113
, the following quantity corresponds to the L2 norm
()
Jx
()
()
11
xh
dd
ij
hx
Jx
==



()
f
eJ
Jx
=+
114
()
()
()
li
ax
()

li
ax

115
distribution with mean

(

1,
B

B
we would like to achieve.
()
()
()
()
g
x
Px
DP
QP
Qx
=


00
Qx
Px

PQ
Whenever
()
Px
, the contribution of that term will be
ml
xx
In our case, the
()
()
2
g1
1l


=
+

JJ


116
backpropagation on each example.
the hidden layers, if any.
117
118
119
()
pv
Here, v is our state, E is our energy function, and Z is the partition function;
Ze
Before
()
jj
aW
10
10

()
ii
jj
ij
Ev
In matrix notation, it is as follows:
()
2
Ev
vW
The
in the equation is because we are going through every pair of i and j
A
()
px
==
Here,


n
px
Wp
xW



px
()
()
xW
px

xp
xW
Ze


x'
p(x)
We now have W as part of our model, so the distribution will change to
()
px

xp
xW

()
()
()
px
px
Wx
xW


()
px


nn
nn
px
px
xx
xx



()
()
nn
ij
xx
()
()
()
()
()
xp
xW
px
xx
xx
px


=
()
ij
xx



,|
ij
xp
xx
px


()
TT
Ev
ha
vb
hv
=

av
ii
av
Now we need to take the gradients of our biases and weights with respect to
this new energy function.
()
()
|,
ph
vW
bs
()
()
|,
pv
hW
bs
()
pv


|,
ji
hb
pv
Wa
ba
ve

\r\r
Here, i is going through each visible node and j through each hidden node. If






00
,|
|,
pv
vp
hv
pv
hp
hv






pv
vp
vh








00
11
pv
ph
vp
hv


()
()
()
00
pv
is approximated by taking the Monte Carlo samples from
128 ]Implementation in TensorFlow
computational graph during usage. In this case, the
hold the
value we will be passing in, and the
want 784 values, one for each pixel, and the
a None dimension means that it can be of any size; so, this will allow us to send
our preceding equations. The argument passed to it is how the variable values
visible_bias = tf.Variable(tf.zeros([784]))
()
()
()
00
ph
()
()
()
10
pv
our
()
()
()
11
ph
()
()
()
()
00
ph
()
()
()
10
pv
activation
()
()
()
11
ph
weight_update =
(positive_phase – negative_phase))
adds the given quantity to the variable. Here, 0.01 is our learning rate, and we scale
hidden_bias_update = hidden_bias.assign_add(LEARING_RATE *
tf.reduce_mean(hidden_activation - final_hidden_activation, 0))
Note that this is purely used for information; there is no backpropagation run against
epochs 9 loss 410.387
epochs 10 loss 313.583
command on the
placeholder:[mnist.train.images[0]]})
look like:

with different numbers of hidden nodes
do an almost perfect reconstruction of the image, with only a little blurring
around
the edges. But as
order to preserve only the most informative part of it and be able to deterministically
recognize our food, to run away from danger, to recognize our friends and family,
biological models
, the paper
published in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton titled:
. Though the genesis

because of noise in the data, they may not be precisely plotted onto a parabola
second we have matched our prediction to the data in such a way that our prediction
approximates the shape of the line connecting the data points, but that does not match exactly the data points.
The second curve, though less precise on the current input, is more likely to predict future data points than the
curve in the first figure.
plt.show()
values that are all the same. However, we can instead use different values, for
neighboring values of the input, add them up,
following images:
weight to a neuron

equal to the regular activation value calculated by summing the contributions of all input neurons

mediated by the corresponding connection weights.
picture. In that case, we would assume that the missing neurons would have value
– filter_h + 1
(
size will be (
)
ww
IP
+
)
hh
IP
S
+
+
S

wh
WD
FF


convolutional layer. As we discussed, one of the advantages of convolutional layers
()
ww
IF
S
=+
()
hh
IF
=+
measure, which is the square root of the sum of all the squares. In practice,
also generally be applied to a fully connected layer, is to "drop" some neurons and
p
During each training period, each neuron has probability
p
implement a simple example of a convolutional layer using Theano.
155 ]The skimage module we imported can be used to load an image, we will import
156 ]A convolutional layer example with Keras
dropped once every four times.
This will take us above 99%. If we want to improve more (accuracy may vary due
to differences in initializations), we can also add more dropout layers, for example,
np.random.seed(0) #for reproducibility
model.compile(loss='categorical_crossentropy',
Very Deep Convolutional
Left: Visual illustration of the RNN recurrence relation:
recurrence relation. An important implication of this is that RNNs have memory
to
one
to
many
to
one
to
many
to
many
168 ]RNN — how to implement and trainIn the previous section, we brie�y discussed what RNNs are and what problems they
activities we built during the forward step. This backward pass pops activities off the
tt
tt
SS
SS






t
US



WS




172 ]Vanishing and exploding gradients
tt
tm
tm
SS



happens is that the cost surface we are training on is highly unstable. Using small
Gradient clipping, where we threshold the maximum value a gradient can
practice, due to

(


tf
ti
tc
tc
to
to
tt
tt
tt
fW
iW
aW
hU
xb
oW
hU
xb
cf
ci
ho








tf
tf
fW
hU
xb


ti
ti
iW
hU
xb

, is derived from the previous output (
)
tc
tc
aW
hU
xb
tt
tt
cf
ci


as input and outputs 0 or 1 through the logistic function available for each cell block's

to
to
oW
hU
xb


tt
ho

long sequence, say
infeasible. Calculating the joint probability of
P(w
applying the following chain rule:




21
32
,,
||
,,
Pw
wP
wP
ww
Pw
ww
Pw
ww
approximated by an independence assumption that the
words. For example, in the phrase
the quick brown fox
()
()
word
Pw
ro
fw
()
()
()
1
1
ii
ii
ww
Pw
)
)
()
,,
|,
nn
in
ni
nn
nn
ww
Pw
ww
ww
+
180 ]The independence assumption that the ith word is only dependent on the previous





23
,,
Pw
wP
wP
wP
wP




21
32
,,
||
mm
Pw
wP
wP
ww
Pw
ww
Pw
ww


(each word is encoded uniquely). The embedding function then transforms this V-dimensional

space into a distributed representation of size D (here D=4).
the words. It associates each word in the vocabulary with a continuous-valued
182 ]The embedding layer takes the one-hot representation of the word wi and converts it into its embedding by multiplying it with the embedding matrix C. This computation can be ef�ciently implemented by a table lookup. The embedding matrix C is shared over all the words, so all words use the same embedding function. C is represented by a V * D matrix, where V is the size of the vocabulary and D the size of the embedding. The resulting embeddings are concatenated into a hidden layer; after this, a bias b and a nonlinear function, such as tanh, can be applied. The output of the hidden layer is thus represented by the function z = tanh(concat(wt-n+1 , …, wt-1) + b). From the hidden layer, we can now output the probability distribution of the next word wt by multiplying the hidden layer with U. This maps the hidden layer to the
and a model of the probability function for sequences of words. It is able to
learn about a model based
state_variables = tf.python.util.nest.pack_sequence_as(
self.initial_state,
for var in tf.python.util.nest.flatten(initial_state)))
probabilities_flat = tf.nn.softmax(logits_flat)
# Reshape to original batch and sequence length
Training
applications, ranging from speech recognition to creating intelligent chat bots that
words. For example, the word "bat" is composed of three phonemes
recognition system. A typical speech recognition pipeline takes in an audio signal

This means that roughly 50,000 amplitude samples were taken for this 1.2-second
For only a small example, these are a lot of points over the time dimension. To
reduce the size of the input data, these audio signals are typically preprocessed

which is a representation of how the frequencies in the signal change over time,
windows and taking the Fourier transform of each of these windows. The Fourier
Say the previous "hello world" recording is divided into overlapping windows
of 25 ms with a stride of 10 ms. The resulting windows are then transformed into
a frequency space with the help of a windowed Fourier transform. This means
that the amplitude information for each time step is transformed into amplitude
spoken as text. This can be
done by learning a time-dependent model that takes in a sequence of audio features,
as described in the previous section, and outputs a sequential distribution of possible
The acoustic model tries to model the likelihood that a sequence of audio features
was generated by a sequence of words or phonemes:
P (audio features | words) = P
(audio features | phonemes) * P (phonemes | words)
.
A typical speech recognition acoustic model, before deep learning became popular,
Gaussian
mixture model
window of acoustic features. HMMs are used to model the sequential structure of
data, while GMMs model the local structure of the signal.
The HMM
assumes that successive frames are independent
given the hidden state of
the HMM. Because of this strong conditional independence assumption, the acoustic
features are typically decorrelated.
199 ]Doing this reduction means that multiple output sequences can be reduced to the same output labeling. To �nd the most likely output labeling, we have to add all the paths that correspond to that labeling. The task of searching for this most probable output labeling is known as decoding (see the Decoding section).An example of such a labeling in speech recognition could be outputting a sequence
of phonemes, given a sequence of acoustic features. The CTC objective's function,
built on top of an LSTM, has been to give state-of-the-art results on acoustic modeling
need of using HMMs to model temporal variability [40], [41].
The most likely word sequence, given a sequence of audio features, is found by
searching through all the possible word sequences [33]. A popular search algorithm
ost likely sequence
search algorithm that is
s in an HMM.
this chapter by mentioning end-to-end techniques. Deep learning
201 ]SummaryIn the beginning of this chapter, we learned what RNNs are, how to train them, what problems might occur during training, and how to solve these problems. In the second part, we described the problem of language modeling and how RNNs help us
[8] Paul J. Werbos (1990). "Backpropagation Through Time: What It Does
and How to Do It" Proceedings of the IEEE. URL:
edu/~martinez/classes/678/Papers/Werbos_BPTT.pdf
[9] Razvan Pascanu and Tomas Mikolov and Yoshua Bengio. (2012).
proceedings.mlr.press/v28/pascanu13.pdf
[33] Mark Gales and Steve Young. (2007). "The Application of Hidden
Markov Models in Speech Recognition". URL:
ac.uk/~mjfg/mjfg_NOW.pdf
[34] L.R. Rabiner. (1989). "A tutorial on hidden Markov models and
selected applications in speech recognition". URL:
ca/~murphyk/Bayes/rabiner.pdf
[45] G.D. Forney. (1973). "The viterbi algorithm". URL:
systems.caltech.edu/EE/Courses/EE127/EE127A/handout/
ForneyViterbi.pdf
[46] Xavier L. Aubert (2002). "An overview of decoding techniques for large
vocabulary continuous speech recognition". URL:
afs/cs/user/tbergkir/www/11711fa16/aubert_asr_decoding.pdf
[47] Alex Graves and Navdeep Jaitly. (2014). "Towards End-To-End Speech

Board Games
what life in the 21st century would look like. They imagined a world of people
as checkers and chess. Eventually, we'll build up enough knowledge to be able to
Agent
is the player that will try to find its way through the maze.
Environment
Reward
They are both games of perfect information. The entire state of the game is
always known to both the players unlike a game such as poker, where the
211
value of the best possible action that the player could make. This gives us the value
if has_3_in_a_line([board_state[2 - i][i] for i in range(3)]):
min-max for the whole game from the board's starting position until we have gone
218 ]The �rst two arguments to the method, which we are already familiar with, are board_state and side; however, max_depth is new. Min-max is a recursive algorithm, and max_depth will be the maximum number of recursive calls we will
starts. If you have reached
player. If you haven't reached
� if best_score is None or score best_score:
best_score = score
best_score_move = move
best_score = score
best_score_move = move
score of +5. The max node will never choose a worse outcome than this. But now that
position and are known to lead to favorable or unfavorable positions. For example,
224 ]Training AI to master Go
those that involve min-max. If we go back to chess as an example, in the position
winning move; however, if the samples randomly move, black has an opportunity to win
trees
only gives an average value; though it allows us to
work with much larger state spaces that cannot be evaluated with Min-Max. Is there
If we play machine

that the true mean is greater than the sample mean—
Hoeffding's inequality: Hoeffding, Wassily (1963).
bounded random variables
{}
PE
xx
ue
�+
pe
pn
=
ln
p

p
u
=
pn
ln
n
u
=
UCB1
) algorithm. We can substitute our values with the
values in the multiarmed bandit problem we saw earlier, where
reward received from the machine
i
the sum of plays across all machines:
ii
nn
discuss next. We will do successive rollouts from the current board state. At each
The preceding diagram illustrates how this happens. In position A, there
is statistics collected for all four possible moves. Because of this, the UCB1
position B
have statistics collected on them. Because of this, the move you need to
We add statistics for any position that we passed through that already has
Here is what this looks like in Python for our
total_samples

move, current_side)
for move in
available_moves(current_board_state)}
if not move_states:

If there are no moves to make, we are in a terminal state, it is a draw. So you need to
record that as
move_states):
log_total_samples = math.log(sum(state_samples[s]
for s in move_states.values()))
move, state = max(move_states, key=lambda _, s:
upper_confidence_bounds(state_results[s],
state_samples[s], log_total_samples))
else:
move = random.choice(list(move_states.keys()))
otherwise, we make the selection randomly:
state that the move puts us in:

current_side))
if current_board_state not in state_samples:
arise is that the moves used in the evaluation are selected randomly when we know

|,
pa


|,
ER
Pa
sr



|,
ER
Pa
sr


|,
Pa
{}
()
()
()
|,
|,
|,
Pa
ER
Pa
Pa
=
()
()
()
()
fx
fx
fx




|,
g|
ER
Pa
sP
as


|,
Pa
2.
Run a batch of episodes with our agent running in its environment. Select
its actions
j
i
x
j
y
c


hidden_weights_3 = tf.Variable(
tf.truncated_normal((HIDDEN_NODES[1], HIDDEN_NODES[2]), stddev=1. /
np.sqrt(HIDDEN_NODES[1])))
output_weights = tf.Variable(tf.truncated_normal((HIDDEN_NODES[-1],
OUTPUT_NODES), stddev=1. / np.sqrt(OUTPUT_NODES)))
reward we receive from the environment, in this case, the result of our game of
tic-tac-toe. The other is meant for the actual action we will take at each time step.
244 ]In the make_move method, we do a few different things. First, we �atten board_state, which starts as the second array in a one-dimensional array that we need to
examples go into our mini-batches for training. Many different values of this work
well; 100 is one of these.
we have done. This will track when we need to kick off a mini-batch training:
reward =
a game using our old friend, the
247 ]Policy gradients in AlphaGo



Vs
fs


front of a large audience with a $1,000,000 prize for the winner. Lee Sedol was very
runs a lot wider. Many problems that we encounter can be formulated, such as
A supervised learning approach to

ti
Vr
gr

256 ]In this equation, V is the reward for a sequence of actions taken, rt is the reward received at time t in this sequence, and g is a constant, where 0 g 1, which will mean rewards further in the future are less valuable than rewards achieved sooner; this is often referred to as the discount factor. If we go back to our maze, this function




x,
Qs
ar
ds
ag
Qs
af
sa
sa







,m
,,
,,
sa
gQ
sa
aa
sa
Qa


261 ]We create a state_batch, each item of which is each of the states in the game,
minus_action_q_value = DISCOUNT_FACTOR
*(states[minus_action_index] +
if terminal[plus_action_index]:
plus_action_q_value = DISCOUNT_FACTOR *
states[plus_action_index]
else:
(states[plus_action_index] +
in the hidden layers in order to
tanh
Learning for Board Games
hits capacity will start removing items from the beginning of the queue. Making the
deque here has a maxlen of 20,000, which means we will only store the last 20,000
decided on from the previous main loop. It will be a one-hot vector:
env.render()
last_action = choose_next_action(last_state)
current_state, reward, terminal, _ =
env.step(np.argmax(last_action))
to take based on the
train on:
actions = [d[1] for d in mini_batch]
rewards = [d[2] for d in mini_batch]
current_states = [d[3] for d in mini_batch]
200 turns. After 400 games, we beat that comfortably averaging well over 300 turns
cart pole task we just looked at. For cart pole, if a bad move is made that leads to the
1: Move left
2: Stay still
3: Move right
scores.append(reward_per_game)
see the path of the ball and the direction in which both paddles are moving:
that we need to map our three actions to. This sounds like a lot, but the problem
tf.nn.conv2d(hidden_convolutional_layer_2,
convolution_weights_3, strides=[1, 1, 1, 1],
padding="SAME") + convolution_bias_3)
convolutional layer,
our convolutional layer, which is of dimensions none, 9, 11, 64 into a
feed_forward_bias_1 = tf.Variable(tf.constant(0.01,
shape=[FLAT_HIDDEN_NODES]))
final_hidden_activations = tf.nn.relu(
tf.matmul(hidden_convolutional_layer_3_flat,
feed_forward_weights_1) + feed_forward_bias_1)
feed_forward_weights_2 =
tf.Variable(tf.truncated_normal([FLAT_HIDDEN_NODES,
feed_forward_bias_2 = tf.Variable(tf.constant(0.01,
shape=[ACTIONS_COUNT]))
output_layer = tf.matmul(final_hidden_activations,
feed_forward_weights_2) + feed_forward_bias_2
is the combination of multiple frames, for breakout
use it.
Another issue we may run into is that while the cart pole is trained in as little as a
couple of minutes, for Breakout, training will be measured in days. To guard against
random. This same Q-learning algorithm has been tried on a wide range of Atari


,,
Qs
aU
sa




1
2
2
,
Us
aU
sa

Q-learning for a computer game, neither technique
is limited to that type. Originally,
Model-based learning
: In this approach, which will be discussed in more
at is the baseline actor critic. Here, the critic tries to learn the average performance of

tt
bs


()
bs
()
tt
bs
massively reducing the variance of training. If we run the cart pole task once using
()
()
()
,,
tt
tt
As
aQ
sa
Vs
=


tt
Qs
ar
Vs

that in to give us our advantage function purely in terms in
V



tt
As
ar
Vs
Vs

gives the next state given the current state and action pair:
()
t
t
sf
sa
294 ]The �rst example will reuse the MNIST digit dataset that you have seen Chapter 3, Deep
295 ]In a log of transactions, we observe that a particular customer spends an average
of $10 for their lunch every weekday. Suddenly, one day they spend $120. This is
certainly an outlier, but perhaps that day they decided to pay the whole bill with
their credit card. If a few of those transactions are orders of magnitude higher than
their expected amount, then we could identify an anomaly. An anomaly is when the
tance, transactions
of $120 or higher over three consecutive orders. In this scenario, we are talking of
anomalies because a pattern of repeated and linked outliers have been generated from
outside the expected distribution, but still acceptable. The effect on the distributions
Even if systematically
reasonable
explanation. The fraudulent activity would soon turn into the newer expected
behavior, the normality.
would do for many machine
but also the toughest. It is like a chess game. On one side, you have subject matter
unsupervised approaches. In addition, we also propose a semi-supervised schema:
Supervised
: Anomaly
models such as Multivariate Gaussian Distribution.
blob/master/hands-on_training/images/autoencoder.png)
to working with R, Spark, and Python pandas data frames.
The backend can then be switched among different engines. It can run locally in your
machine or it can be deployed in a cluster on top of Spark or Hadoop MapReduce.
H2O will automatically handle the memory occupation and will optimize the
execution plan for most of the data operations and model learning.
It provides a very fast scoring of data points against a trained model; it is advertised
to run in nanoseconds.
In addition to traditional data analysis and machine learning algorithms, it features a
few very robust implementations of deep learning models.
The general API for building models is via the
train = train_with_label[predictors]
neurons using a hyperbolic tangent as an activation function and 100 epochs (100
df = digit.as_data_frame()
the left) and the "bad" (on the right). The rightmost tail (greater than 0.05) could be
311
If we look
bottom-left shape of the digit so long to almost touch the corner.
score (~1.62) is probably explained by the abnormal handwriting style.
This
image as anomalous due to a licit property simply because the training data does not
test_ugly = sorted_test_with_error_df.tail(100)
plot_multi_digits(test_ugly, 10, 10, "ugly digits")
the easiest digit to write due to its simple structure of a straight line. Thus, digits of
The
handwriting. Thus, those are most likely to represent "anomalies". They are most
Please be careful that different runs may lead to different results due to randomness
introduced for scalability reasons due to race conditions generated by the Hogwild!
Hence, we
reproducible=True)
model.train(
training_frame=train_ecg
)
indistinguishable, which means they will have a higher reconstruction error.
layer_column = 'DF.L{}.C'.format(layer + 1)
columns = [layer_column + '1', layer_column + '2']
ax.annotate(k, v, size=20, verticalalignment='bottom',
horizontalalignment='left')
fig.canvas.draw()
representation of data points seed {}".format(seed))
What is more interesting is that the outlier points (marked with 20, 21, and 22)
We could then use the auto-encoder for reducing the dimensionality followed by
an unsupervised approach (for example, density-based clustering) to group similar
A comprehensive list of techniques for adaptive learning
How to validate both in presence and absence of ground truth
ii
hw


ii
Eh
Ew
Ex



mean will be

22
22
d
ii
wx
Eh
Ew
Ex
dd





The output
of the hidden unit j will be transformed
()
ya
nh
11
dd



()
11
12
12
3
dd
Eh







==
01
234
LL
LL


01
LL

12
LL

L
the input data
L
auto-encoder, which will be used to initialize the weights of
The decoding layers share the same initial weights and bias of the encoding
01
2345
LL
LL
LL



into
01
LL

12
LL

234
LL

In general, if the deep auto-encoder has N layers we can treat it as a stack of
()
N
ii
Ni
LL


()
1

()
,|
Bj
ww
=

()
,|
jk
jk
Bj
bW
=

backpropagation and
While SGD is the de-facto most popular training algorithm for many machine
have been proposed in the literature, but most of them are bottlenecked by the
processors, without taking
L

portion of the vector
cost function


ee
eE
Lx
Lx
L
input vector (
If we have

;1
vv
xx
av

()
Gx
ve

vv
ve
xx
bG


corresponds to a selected index of
e
Because computing the gradient is not instantaneous and any processor may have
x
x
value read many clock
in providing
conditions under
which this asynchronous, incremental gradient algorithm
converges.
331 ]If θt represents the quantity we are updating at iteration

tt




Dt






tt
gt





tt
tv
vL



momentum could lead to divergence though. Suppose we are running SGD with

ml
vv




v
Moreover, at the beginning of the learning, there may already be large gradients (the
effect of weights initialization). Thus, we would like to start with a small momentum
(for example, 0.5); once the large gradients have disappeared, we can increase the


tt
tt
tv






, if we omit the second derivative, can be approximated as:





11
11
tt
tt
LL
vL
Lv
Lv












t



11
tv
Lv
Lv








()
()
()
()
()
1
1
1
1
t
t
t
tH
LL

=

()
:,
ti
+

GL

()
=

+
property that progress along each dimension evens out over time, just like second-




EL
EL










EL




()
()
SL
EL


+



()
()
SL
=


In order to make those equations correct, we shall ensure that the units are matching.
()
()
()
()


()
()
()
()
()
()
HL




[]
SE

=
+



tt


[]
()
()
SL
=


distribution of computation
and data processing in a map/reduce fashion.
the high-level distributed algorithm as follows:
1.
Initialize
2.
Shuffling
We will cover this data replication problem at the end of the paragraph.
3.
Map
4.
Reduce
the final one. This is a monoidal and commutative operation; averaging is
associative and commutative.
5.
Validate
points in each node and p the number of processors (the nodes). The linear term is
In the preceding formula, we are not considering the memory occupation and the
replicate_
Description
0
False
Only one epoch, averaging over N models
-1
True
342 ]Sparkling Water is an alternative to the ML and MLlib frameworks for doing machine learning and one of the few alternatives for deep learning on top of Spark.Spark is designed and implemented in Scala. In order to understand the
SQL type
H2O type
NA
BinaryType
Numeric
Byte
344 ]Testing
system that:
Takes some of the available data
Divides into training and validation, making sure to not introduce biases or
In our context, we want to quantify the quality of our theory and verify it is good
enough and that it an evident advantage with respect to a much simpler theory
Fruitfulness
Is it
connections and marking them as normal or suspicious. This will allow the business
Cross-fold
validation introduces a problem with class
unbalancing. By
sR
higher the probability of being an anomaly. For auto-encoders, it could simply be the
MSE computed on the reconstruction error and rescaled to be in the range[0,1]. We
We can now validate using either the ROC or PR curve.
n
Predicted anomaly
True Positive (TP)
False Negative (FN)
True non-anomaly
False Positive (FP)
True Negative (TN)

measures of
True Positive Rate
cal
l
R
FN
==
False
positiv
er
atio
TN
=
a
curve consisting of
()
Rf
FP
()
2
Recall
Precision
Recall

=+

Precision
Recall


nm
Fr
pp
rp
Fo

We
Mass-Volume
(
sR
the score, the higher

The scoring function will give an ordering of each observation.

:0
fR
be the probability density function of the normal distribution of a




bx
Xs
xt







bs
tL
xX
sx


Telecom ParisTech/CNRS
f
the scoring



()
st
LI
()
gm
in
LI
performance criterion for MV is
()
()
LI
Cs
for
One workaround is to choose the interval
[]
,
0.999
=





1
s
up
,
sx
tt
bs




()
()
LI
Cs

IE
()
()
()
0
f
s
t
EM
t
µ


UMR LTCI No. 5141, Telecom ParisTech/CNRS
algorithms, where the result of one trial affects the choice of next trials and/or
Relevance performance
select in the top priority queue is important but the ranking also matters.
testing. A/B testing is a statistical hypothesis testing with two variants (the control
()
()
Iv
nK
the two groups. We expect that the performance should be very similar equivalent
: The effectiveness of the advanced model alone
: The effectiveness of the advanced model in case of security
: The effectiveness of both advanced model and security
369 ]A summary of testing
that are both
Data Science Manifesto:
reading and reasoning around the listed principles.
easy.EasyPredictModelWrapper
entire H2O framework but only the
interfaces. It can be read and used from anything that runs in a JVM.
The POJO object will contain the model class name corresponding to the model id
used in H2O (
sum = sum + reconstrunctionError[i];
} p.averageReconstructionError =
sum/reconstrunctionError.length;
rawModel = (hex.genmodel.GenModel)
Class.forName(modelClassName).newInstance();
EasyPredictModelWrapper model = new
EasyPredictModelWrapper(rawModel);
RowData row = new RowData();
row.put("Feature1", "value1");
AutoEncoderModelPrediction p = model.predictAutoEncoder(row);
System.out.println("Reconstruction error is: " +
p.averageReconstructionError);
}
use it for scoring a mock data point. We can re-adapt this code to be integrated
Any browser using simple add-ons, such as Postman in Chrome
curl, one of the most popular tools for client-side URL transfers
servers.
about 197
attention-based models 199
about 185
data, preprocessing 186
data, reading 186
example training 192
decoding 199, 200
Stochastic Gradient Descent (SGD) 328
about 3, 4, 5
support vector machines (SVM) 20, 28
about 92
testing
about 344-350

Приложенные файлы

  • pdf 24138397
    Размер файла: 5 MB Загрузок: 0

Добавить комментарий