*Чтобы посмотреть этот PDF файл с форматированием и разметкой, скачайте его и откройте на своем компьютере.*

vision, AI, speech and data analysis

Valentino Zocca

Valentino Zocca

About the Authors

and its applications. He currently holds the position of Head of Data Science at

www.PacktPub.com

Did you know that Packt offers eBook versions of every book published, with PDF

editorial process. To help us improve, please leave us an honest review on this book's

If you'd like to join our team of regular reviewers, you can e-mail us at

eBooks and videos in exchange for their valuable feedback. Help us be relentless in

Table of Contents

Chapter 1: Machine Learning – An Introduction

60

60

Medical

60

Autonomous car driving

61

Business

61

Pattern recognition

61

Speech production

61

A convolutional layer example with Keras to recognize digits

A convolutional layer example with Keras for cifar10

Early game playing AI

Implementing a Python Tic-Tac-Toe game

Training AI to master Go

Deep learning in Monte Carlo Tree Search

Policy gradients in AlphaGo

A supervised learning approach to games

great deal of public attention. Every day, deep-learning algorithms are used broadly

Starting with a quick recap of important machine learning concepts, the book will

delve straight into deep learning principles using scikit-learn. Moving ahead, you

, covers two of the most powerful and

often-used architectures for unsupervised feature learning: auto-encoders and

restricted Boltzmann machines.

Chapter 5

cortex works and introduces convolutional layers, followed up with a descriptive

Chapter 6

Matplotlib

H2O .

scikit-learn

foundational understanding of machine learning concepts and some programming

pathnames, dummy URLs, user input, and Twitter handles are shown as follows:

The code above for drawing should be immediately clear, we just notice that the line

Choose from the drop-down menu where you purchased this book from.

Click on

Code Files

book's webpage at the Packt Publishing website. This page can be accessed by

entering the book's name in the

Search

to your Packt account.

using the latest version of:

WinRAR / 7-Zip for Windows

Zipeg / iZip / UnRarX for Mac

7-Zip / PeaZip for Linux

The code bundle for the book is also hosted on GitHub at

PacktPublishing/Python-Deep-Learning

from our rich catalog of books and videos available at

PacktPublishing/

xii ]To view the previously submitted errata, go to https://www.packtpub.com/books/content/support and enter the name of the book in the search �eld. The required

1 ]Machine Learning – An

Different machine learning approaches

Steps involved in machine learning systems

Brief description of popular techniques/algorithms

Applications in real-life

A popular open source package

important business e-mails and some of which are un-solicited junk e-mails, or spam.

function

learning's recent successes are related to it being so effective in unsupervised

New data is created every day, very quickly, and labeling all the new data is

quite a laborious and time-consuming activity. One advantage of unsupervised

9 ]A simple example of reinforcement learning can be used to play the classic game

of tic-tac-toe. In this case, each position on the board has associated a probability (a

value), which is the probability of winning the game from that state based on previous

11

12 ]Another example may be given by attempting to use machine learning to predict the

trajectory of a ball thrown from the ground up in the air (not perpendicularly) until

it reaches the ground again. Physics teaches us that the trajectory is shaped like a

parabola, and we expect that a good machine learning algorithm observing thousands

of such throws would come up with a parabola as a solution. However, if we were to

air due to turbulence,

we might notice that the ball does not hold a steady trajectory but may be subject to

small perturbations. This is what we call "noise". A machine learning algorithm that

tried to model these small perturbations would fail to see the big picture, and produce

s the process that makes the

ball thrown from the ground

techniques/algorithms

three classes discussed at the beginning of the book, supervised learning,

The following is by no means meant to be an exhaustive list or a thorough

input data to predict a value, for example the cost of a house

bottom graph the function has two local minima, and therefore, depending on the initialization,

the process may find the first local minimum that is not the global minimum.

16 ]Decision treesAnother widely used supervised algorithm is the decision tree algorithm. A decision tree algorithm creates a classi�er in the form of a "tree". A decision tree is comprised of decision nodes where tests on speci�c attributes are performed, and leaf nodes that indicate the value of the target attribute. Decision trees are a type of classi�er that works by starting at the root node and moving down through the decision nodes

how very different the decision tree algorithm is from linear

indicate where the new franchises should be located and their corresponding delivery areas

Probabilistically, what most machine learning techniques try to evaluate is the

of a certain event

For example, given the picture representing a digit (that is, a picture with a certain

into account the fact that the cancer is quite rare under 50, therefore the positivity of

p(cancer)

, or

more in general the probability p

for the outcome we are trying to estimate, is called

the prior probability, because it represents the probability of the event without any

At this point, we may wonder what would happen if we had more information, for

example if we performed a different test with different reliability, or knew some

information about the person such as recurrence of cancer in the family. In the

preceding equation we used, as one of the factors in the computation, the probability

, and if we performed a second test, and it came positive,

we would also have

p(test2=positive | cancer)

. The naïve Bayes technique makes

the assumption that each piece of information is independent of each other (this

p(test1 and test2=pos | cancer) =p(test1=pos | cancer)*p(test2=pos | cancer)

This equation is also called the likelihood

L(test1 and test2 = pos)

that

positive given the fact that the person does have cancer.

We can then rewrite the equation as:

p(cancer | both tests=pos) =

= p(both test=pos | cancer)*p(cancer)/p(both tests=pos) =

= p(test1=pos | cancer)*p(test2=pos | cancer) *p(cancer)/p(both tests=pos)

machines over other

machine learning algorithms is that not only does it separate the data into classes,

predictive ability of the algorithm. As we have discussed above, in practice it is

The kernel trick instead involves mapping the space of features into another space

the vector

x

general). Hence, any vector

0

()

ii

ax

ii

wx

()

ii

ax

=

the bias is included in this formula) and we have said that the neuron activates if the

activation is greater than

0

this is called the threshold activity, because the neuron activates when the activation

is greater than

news in 2016 by beating world Go champion Lee Sedol. In January 2016, AlphaGo

The strength of AlphaGo was that it had not been programmed to play, but it had

learned

thousands of games against itself using reinforcement

learning and deep learning techniques. The ability to learn is what renders machine

algorithm chosen, such as

naive_bayes

pred = mlp.predict(data_test)

pictures was adapted from similar code by Sebastian Raschka in his

Python Machine

Learning

plt.ylim(y.min(), y.max())

# plot the data

()

()

x0

fx

()

()

1e

fx

+

()

()

()

()

()

()

pe

pe

xx

fx

xx

==

+

logistic activation function, the code would modify to be:

activation = "logistic")

supervised learning, unsupervised learning, and reinforcement learning. Classical

and W. Pitts. A Logical

ii

xw

value of each input neuron, and

ii

xw

xb

absence of intra-layers connections, while neurons connect to each and every neuron

different types of

10

00

a

fa

certain value and it is called the threshold activity function

()

()

1e

fa

+

()

()

()

()

1e

1e

1e

fa

+

+

()

()

()

()

()

()

()

pe

1e

pe

1e

aa

fa

aa

+

+

00

fa

fa

counterpart, it is a mix of the identity and the threshold function, and it is

10

00

a

a

sigmoid function. This can be quickly calculated by simply noticing

that the derivative with respect to

a

1e

+

()

()

()

()

()

1e

)1

1e

)1

)1

)1

a

a

a

a

+

+

++

+

+

()

()

()

()

()

1e

1e

)1

)1

ff

+

=

+

+

+

1e

+

ii

xw

xb

the right is a sigmoid with weight 10 and bias 50.

to a step function. The larger

b

will be equal to the negative of the ratio

b/w

–b/w

function.

mapped to the output neuron with weights

w

logistic sigmoid activity function to each hidden neuron, and the identity function to

R

#as well as biasValue1 and biasValue2 to observe

import numpy

import matplotlib.pyplot as plt

print ("The step function starts at {0} and ends at {1}"

.format(-biasValue1/weightValue,

y1 = 1.0/(1.0 + numpy.exp(-weightValue*x - biasValue1))

y2 = 1.0/(1.0 + numpy.exp(-weightValue*x - biasValue2))

()

ii

yt

dimension of

w

w

()

ii

yt

()

()

()

ii

ii

ii

yt

yt

=

=

=

ii

yx

�

i

j

j

x

w

=

()

()

ii

ii

ij

yt

yt

indicates the i

indicates the

ii

jj

ij

ww

xy

ii

ww

yt

wJ

is what is often called gradient descent.

,,

ww

of writing the update rule for

w

components

where, instead of writing the partial derivative for each

cases, the weights can be updated

activation function is not going to be the identity function

like before, rather we are going to use the logistic sigmoid function. The

the logistic sigmoid function, therefore, for each example x, the probability that the

w

ai

Pt

ai

ft

,1

Pt

xw

aa

()

()

()

()

()

()

ii

Pt

xw

Pt

xw

aa

g,

g1

g1

g1

aa

ta

()

()

()

()

()

()

()

()

g1

g1

ta

ta

+

()

()

()

()

()

()

()

()

()

ii

ta

xt

ax

+

=

+

()

()

()

()

aa

=

()

j

a

=

kk

ww

()

()

()

()

()

()

()

()

()

()

ii

aa

ax

=

()

()

()

ii

ax

t

generalize this equation using

()

()

()

g,

Jw

Py

xw

=

()

()

,l

ij

Et

ii

jj

ij

ww

xa

i

belonging to the layer preceding the layer containing the element with subscript

j

In this example, layer 1 represents the input, and

layer 2

connecting the

jj

ij

jji

ya

JJ

wy

aw

ij

w

ij

jj

JJ

wy

jj

ij

jji

ya

JJ

wy

aw

The

function of its activation value, and, of course, the cost function is a function of the

activity function we have chosen.

ij

w

j

function that we can calculate, all we need to calculate is the derivative

ij

ji

yy

JJ

yy

yy

ay

=

j

ij

y

can calculate

backward and calculate

wj

ij

yy

equation should read as the sum over all the outgoing connections from

y

neuron

in the successive layer.

jj

ij

jji

ya

JJ

wy

aw

k

i

k

j

JJ

yy

jj

ya

activation value, and we can think of

then rewrite

jj

ij

ji

yy

JJ

yy

yy

ay

()

ij

ji

a

=

jj

ya

()

ij

ii

a

=

ij

ij

ww

,,

ij

ij

ji

ww

apply these concepts and formulas.

neurons, and for the output layer only one neuron:

65 ] #Finally, we adjust the weights, #using the backpropagation rules for i in range(len(self.weights)): layer = y[i].reshape(1, nn.arch[i]+1) delta = delta_vec[i].reshape(1, nn.arch[i+1]) self.weights[i] +=learning_rate*layer.T.dot(delta)This concludes our back-propagation algorithm; all that is left to do is to write a predict function to check the results: def predict(self, x): val = numpy.concatenate((numpy.ones(1).T, numpy.array(x)))

val = self.activity(numpy.dot(val, self.weights[i]))

66 ]Notice the use of numpy.random.seed(0). This is simply to ensure that the weight

separating the regions will be quite different depending on the architecture chosen.

, we introduced machine learning

and techniques that can be used to implement machine learning. In

Fundamental concepts of deep learning

Applications of deep learning

GPU versus CPU

Popular open source libraries

in Proceedings of

Neural Information Processing Systems (NIPS) (2012) and, at the end of their paper,

of the Ising model to deep learning. In

and positive, so the third neuron will also be active. In the second figure, the first two neurons are off,

and their connection with the third neuron is positive, so the third neuron will also be off.

In the second figure, the first two neurons are off, and their connection with the

third neuron is positive, so the third neuron will also be off.

negative, so the third neuron will also be off. In the second figure, the first two neurons are off, and their

connection with the third neuron is large and negative, so the third neuron will likely be on.

The point of introducing this adaptation of the Ising model is to understand how

and large but negative connections with the others.

how we can select weights for our connections, to have the hidden

neurons start recognizing features of our input.

it can recognize features. Another, even more important, is that it will recognize

are the pixels that have an

x

neuron representing the mouth in the next layer will turn on anyway.

of abstraction of basic elements in the image and how they are structured. This

represent lines that intersect, and more complex representations by associating neurons that represent lines at

specific angles.

Autoencoders

: A class of unsupervised learning algorithms in which the

Total number

DNN

(error rate)

GMM-HMM

with same

(error rate)

GMM-HMM

with longer

(error rate)

Switchboard

(test1)

309

18.5

27.4

18.6 (2000hrs)

Switchboard

309

16.1

23.6

17.1 (2000hrs)

English

Broadcast News

50

17.5

18.8

Bing Voice

Search

24

30.4

36.2

Google Voice

5870

12.3

��16.0 (5870hrs)

YouTube

1400

47.6

52.3

transformations of speech spectrograms. A spectrogram is a visual representation of

edges,

layer 5

shows entire objects.

ji

aw

. Inference is the post-training phase where we deploy our trained

DNN. In a whitepaper published by GPU vendor Nvidia titled

GPU-Based Deep

Learning Inference: A Performance and Power Analysis

, available online at

why most popular open source libraries like Theano or TensorFlow allow you to

92 ]We refer to the Theano documentation for learning the �rst steps on how to use Theano, and we will implement some test code examples using Theano for deep learning in this book.TensorFlow

a model, and we will use the

will use regular

, not sparse layers) specifying the number of input and output

neurons. For each layer, we specify the activity function of its neurons:

model.add(Dense(hidden_neurons, input_dim=input_size))

optimization (training rate, momentum, and so on). We are not going to modify the

default values, so we can simply pass:

neurons have learned:

import matplotlib.cm as cm

w = weights[0].T

for neuron in range(hidden_neurons):

cmap = cm.Greys_r)

plt.show()

for drawing above should be immediately clear; we just notice that the

following line is importing

the

what we had before:

model.add(Dense(3000, input_dim=input_size))

model.add(Activation('sigmoid'))

model.add(Dense(2000, input_dim=3000))

introduce the readers to many of the concepts we have touched on in this one, like

Feature Learning

sw

ss

ev

st

em

hf

blob/master/hands-on_training/images/autoencoder.png)

:,

Xc

cX

the network to �nd the

minimum information

LX

g1

g1

Lx

xx

The

at (1,3) with a standard deviation of 3 in the (0.866, 0.5) direction and of 1 in the orthogonal direction. The

directions represent the principal components (PC) associated with the sample. By Nicoguaro (own work) CC

BY 4.0 (http://creativecommons.org/licenses/by/4.0), via Wikimedia Commons.

not always enough.

Autoencoders have the advantage of being able to represent even non-linear

representations using a non-linear activation function.

One famous example of an autoencoder was given by MITCHELL, T. M. in his

book

corresponds exactly with the binary representation with three bits.

There are situations though where just the single hidden layer is not enough to

represent the whole complexity and variance of the data. Deeper architecture can

Autoencoder#/media/File:Autoencoder_structure.png)

In the case of binary inputs, we want to use sigmoid as the output activation

function and cross-entropy, or more precisely, the sum of Bernoulli cross-

For real values, we

mean squared error

For different types of input data(

x

approach, which consists of the following steps:

1.

Finding the probability distribution of observing x, given

u

2.

:,

Xc

cX

110

111

x

reconstructed output.

with a given probability

v

for numerical inputs.

noise level.

We will use the noisy variant,

x

input dimensionality is

will write the loss function

()

()

ii

Jd

Lx

112

113

, the following quantity corresponds to the L2 norm

()

Jx

()

()

11

xh

dd

ij

hx

Jx

==

()

f

eJ

Jx

=+

114

()

()

()

li

ax

()

li

ax

115

distribution with mean

(

1,

B

B

we would like to achieve.

()

()

()

()

g

x

Px

DP

QP

Qx

=

00

Qx

Px

PQ

Whenever

()

Px

, the contribution of that term will be

ml

xx

In our case, the

()

()

2

g1

1l

=

+

JJ

116

backpropagation on each example.

the hidden layers, if any.

117

118

119

()

pv

Here, v is our state, E is our energy function, and Z is the partition function;

Ze

Before

()

jj

aW

10

10

()

ii

jj

ij

Ev

In matrix notation, it is as follows:

()

2

Ev

vW

The

in the equation is because we are going through every pair of i and j

A

()

px

==

Here,

n

px

Wp

xW

px

()

()

xW

px

xp

xW

Ze

x'

p(x)

We now have W as part of our model, so the distribution will change to

()

px

xp

xW

()

()

()

px

px

Wx

xW

()

px

nn

nn

px

px

xx

xx

()

()

nn

ij

xx

()

()

()

()

()

xp

xW

px

xx

xx

px

=

()

ij

xx

,|

ij

xp

xx

px

()

TT

Ev

ha

vb

hv

=

av

ii

av

Now we need to take the gradients of our biases and weights with respect to

this new energy function.

()

()

|,

ph

vW

bs

()

()

|,

pv

hW

bs

()

pv

|,

ji

hb

pv

Wa

ba

ve

\r\r

Here, i is going through each visible node and j through each hidden node. If

00

,|

|,

pv

vp

hv

pv

hp

hv

pv

vp

vh

00

11

pv

ph

vp

hv

()

()

()

00

pv

is approximated by taking the Monte Carlo samples from

128 ]Implementation in TensorFlow

computational graph during usage. In this case, the

hold the

value we will be passing in, and the

want 784 values, one for each pixel, and the

a None dimension means that it can be of any size; so, this will allow us to send

our preceding equations. The argument passed to it is how the variable values

visible_bias = tf.Variable(tf.zeros([784]))

()

()

()

00

ph

()

()

()

10

pv

our

()

()

()

11

ph

()

()

()

()

00

ph

()

()

()

10

pv

activation

()

()

()

11

ph

weight_update =

(positive_phase – negative_phase))

adds the given quantity to the variable. Here, 0.01 is our learning rate, and we scale

hidden_bias_update = hidden_bias.assign_add(LEARING_RATE *

tf.reduce_mean(hidden_activation - final_hidden_activation, 0))

Note that this is purely used for information; there is no backpropagation run against

epochs 9 loss 410.387

epochs 10 loss 313.583

command on the

placeholder:[mnist.train.images[0]]})

look like:

with different numbers of hidden nodes

do an almost perfect reconstruction of the image, with only a little blurring

around

the edges. But as

order to preserve only the most informative part of it and be able to deterministically

recognize our food, to run away from danger, to recognize our friends and family,

biological models

, the paper

published in 2012 by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton titled:

. Though the genesis

because of noise in the data, they may not be precisely plotted onto a parabola

second we have matched our prediction to the data in such a way that our prediction

approximates the shape of the line connecting the data points, but that does not match exactly the data points.

The second curve, though less precise on the current input, is more likely to predict future data points than the

curve in the first figure.

plt.show()

values that are all the same. However, we can instead use different values, for

neighboring values of the input, add them up,

following images:

weight to a neuron

equal to the regular activation value calculated by summing the contributions of all input neurons

mediated by the corresponding connection weights.

picture. In that case, we would assume that the missing neurons would have value

– filter_h + 1

(

size will be (

)

ww

IP

+

)

hh

IP

S

+

+

S

wh

WD

FF

convolutional layer. As we discussed, one of the advantages of convolutional layers

()

ww

IF

S

=+

()

hh

IF

=+

measure, which is the square root of the sum of all the squares. In practice,

also generally be applied to a fully connected layer, is to "drop" some neurons and

p

During each training period, each neuron has probability

p

implement a simple example of a convolutional layer using Theano.

155 ]The skimage module we imported can be used to load an image, we will import

156 ]A convolutional layer example with Keras

dropped once every four times.

This will take us above 99%. If we want to improve more (accuracy may vary due

to differences in initializations), we can also add more dropout layers, for example,

np.random.seed(0) #for reproducibility

model.compile(loss='categorical_crossentropy',

Very Deep Convolutional

Left: Visual illustration of the RNN recurrence relation:

recurrence relation. An important implication of this is that RNNs have memory

to

one

to

many

to

one

to

many

to

many

168 ]RNN — how to implement and trainIn the previous section, we brie�y discussed what RNNs are and what problems they

activities we built during the forward step. This backward pass pops activities off the

tt

tt

SS

SS

t

US

WS

172 ]Vanishing and exploding gradients

tt

tm

tm

SS

happens is that the cost surface we are training on is highly unstable. Using small

Gradient clipping, where we threshold the maximum value a gradient can

practice, due to

(

tf

ti

tc

tc

to

to

tt

tt

tt

fW

iW

aW

hU

xb

oW

hU

xb

cf

ci

ho

tf

tf

fW

hU

xb

ti

ti

iW

hU

xb

, is derived from the previous output (

)

tc

tc

aW

hU

xb

tt

tt

cf

ci

as input and outputs 0 or 1 through the logistic function available for each cell block's

to

to

oW

hU

xb

tt

ho

long sequence, say

infeasible. Calculating the joint probability of

P(w

applying the following chain rule:

21

32

,,

||

,,

Pw

wP

wP

ww

Pw

ww

Pw

ww

approximated by an independence assumption that the

words. For example, in the phrase

the quick brown fox

()

()

word

Pw

ro

fw

()

()

()

1

1

ii

ii

ww

Pw

)

)

()

,,

|,

nn

in

ni

nn

nn

ww

Pw

ww

ww

+

180 ]The independence assumption that the ith word is only dependent on the previous

23

,,

Pw

wP

wP

wP

wP

21

32

,,

||

mm

Pw

wP

wP

ww

Pw

ww

Pw

ww

(each word is encoded uniquely). The embedding function then transforms this V-dimensional

space into a distributed representation of size D (here D=4).

the words. It associates each word in the vocabulary with a continuous-valued

182 ]The embedding layer takes the one-hot representation of the word wi and converts it into its embedding by multiplying it with the embedding matrix C. This computation can be ef�ciently implemented by a table lookup. The embedding matrix C is shared over all the words, so all words use the same embedding function. C is represented by a V * D matrix, where V is the size of the vocabulary and D the size of the embedding. The resulting embeddings are concatenated into a hidden layer; after this, a bias b and a nonlinear function, such as tanh, can be applied. The output of the hidden layer is thus represented by the function z = tanh(concat(wt-n+1 , …, wt-1) + b). From the hidden layer, we can now output the probability distribution of the next word wt by multiplying the hidden layer with U. This maps the hidden layer to the

and a model of the probability function for sequences of words. It is able to

learn about a model based

state_variables = tf.python.util.nest.pack_sequence_as(

self.initial_state,

for var in tf.python.util.nest.flatten(initial_state)))

probabilities_flat = tf.nn.softmax(logits_flat)

# Reshape to original batch and sequence length

Training

applications, ranging from speech recognition to creating intelligent chat bots that

words. For example, the word "bat" is composed of three phonemes

recognition system. A typical speech recognition pipeline takes in an audio signal

This means that roughly 50,000 amplitude samples were taken for this 1.2-second

For only a small example, these are a lot of points over the time dimension. To

reduce the size of the input data, these audio signals are typically preprocessed

which is a representation of how the frequencies in the signal change over time,

windows and taking the Fourier transform of each of these windows. The Fourier

Say the previous "hello world" recording is divided into overlapping windows

of 25 ms with a stride of 10 ms. The resulting windows are then transformed into

a frequency space with the help of a windowed Fourier transform. This means

that the amplitude information for each time step is transformed into amplitude

spoken as text. This can be

done by learning a time-dependent model that takes in a sequence of audio features,

as described in the previous section, and outputs a sequential distribution of possible

The acoustic model tries to model the likelihood that a sequence of audio features

was generated by a sequence of words or phonemes:

P (audio features | words) = P

(audio features | phonemes) * P (phonemes | words)

.

A typical speech recognition acoustic model, before deep learning became popular,

Gaussian

mixture model

window of acoustic features. HMMs are used to model the sequential structure of

data, while GMMs model the local structure of the signal.

The HMM

assumes that successive frames are independent

given the hidden state of

the HMM. Because of this strong conditional independence assumption, the acoustic

features are typically decorrelated.

199 ]Doing this reduction means that multiple output sequences can be reduced to the same output labeling. To �nd the most likely output labeling, we have to add all the paths that correspond to that labeling. The task of searching for this most probable output labeling is known as decoding (see the Decoding section).An example of such a labeling in speech recognition could be outputting a sequence

of phonemes, given a sequence of acoustic features. The CTC objective's function,

built on top of an LSTM, has been to give state-of-the-art results on acoustic modeling

need of using HMMs to model temporal variability [40], [41].

The most likely word sequence, given a sequence of audio features, is found by

searching through all the possible word sequences [33]. A popular search algorithm

ost likely sequence

search algorithm that is

s in an HMM.

this chapter by mentioning end-to-end techniques. Deep learning

201 ]SummaryIn the beginning of this chapter, we learned what RNNs are, how to train them, what problems might occur during training, and how to solve these problems. In the second part, we described the problem of language modeling and how RNNs help us

[8] Paul J. Werbos (1990). "Backpropagation Through Time: What It Does

and How to Do It" Proceedings of the IEEE. URL:

edu/~martinez/classes/678/Papers/Werbos_BPTT.pdf

[9] Razvan Pascanu and Tomas Mikolov and Yoshua Bengio. (2012).

proceedings.mlr.press/v28/pascanu13.pdf

[33] Mark Gales and Steve Young. (2007). "The Application of Hidden

Markov Models in Speech Recognition". URL:

ac.uk/~mjfg/mjfg_NOW.pdf

[34] L.R. Rabiner. (1989). "A tutorial on hidden Markov models and

selected applications in speech recognition". URL:

ca/~murphyk/Bayes/rabiner.pdf

[45] G.D. Forney. (1973). "The viterbi algorithm". URL:

systems.caltech.edu/EE/Courses/EE127/EE127A/handout/

ForneyViterbi.pdf

[46] Xavier L. Aubert (2002). "An overview of decoding techniques for large

vocabulary continuous speech recognition". URL:

afs/cs/user/tbergkir/www/11711fa16/aubert_asr_decoding.pdf

[47] Alex Graves and Navdeep Jaitly. (2014). "Towards End-To-End Speech

Board Games

what life in the 21st century would look like. They imagined a world of people

as checkers and chess. Eventually, we'll build up enough knowledge to be able to

Agent

is the player that will try to find its way through the maze.

Environment

Reward

They are both games of perfect information. The entire state of the game is

always known to both the players unlike a game such as poker, where the

211

value of the best possible action that the player could make. This gives us the value

if has_3_in_a_line([board_state[2 - i][i] for i in range(3)]):

min-max for the whole game from the board's starting position until we have gone

218 ]The �rst two arguments to the method, which we are already familiar with, are board_state and side; however, max_depth is new. Min-max is a recursive algorithm, and max_depth will be the maximum number of recursive calls we will

starts. If you have reached

player. If you haven't reached

� if best_score is None or score best_score:

best_score = score

best_score_move = move

best_score = score

best_score_move = move

score of +5. The max node will never choose a worse outcome than this. But now that

position and are known to lead to favorable or unfavorable positions. For example,

224 ]Training AI to master Go

those that involve min-max. If we go back to chess as an example, in the position

winning move; however, if the samples randomly move, black has an opportunity to win

trees

only gives an average value; though it allows us to

work with much larger state spaces that cannot be evaluated with Min-Max. Is there

If we play machine

that the true mean is greater than the sample mean—

Hoeffding's inequality: Hoeffding, Wassily (1963).

bounded random variables

{}

PE

xx

ue

�+

pe

pn

=

ln

p

p

u

=

pn

ln

n

u

=

UCB1

) algorithm. We can substitute our values with the

values in the multiarmed bandit problem we saw earlier, where

reward received from the machine

i

the sum of plays across all machines:

ii

nn

discuss next. We will do successive rollouts from the current board state. At each

The preceding diagram illustrates how this happens. In position A, there

is statistics collected for all four possible moves. Because of this, the UCB1

position B

have statistics collected on them. Because of this, the move you need to

We add statistics for any position that we passed through that already has

Here is what this looks like in Python for our

total_samples

move, current_side)

for move in

available_moves(current_board_state)}

if not move_states:

If there are no moves to make, we are in a terminal state, it is a draw. So you need to

record that as

move_states):

log_total_samples = math.log(sum(state_samples[s]

for s in move_states.values()))

move, state = max(move_states, key=lambda _, s:

upper_confidence_bounds(state_results[s],

state_samples[s], log_total_samples))

else:

move = random.choice(list(move_states.keys()))

otherwise, we make the selection randomly:

state that the move puts us in:

current_side))

if current_board_state not in state_samples:

arise is that the moves used in the evaluation are selected randomly when we know

|,

pa

|,

ER

Pa

sr

|,

ER

Pa

sr

|,

Pa

{}

()

()

()

|,

|,

|,

Pa

ER

Pa

Pa

=

()

()

()

()

fx

fx

fx

|,

g|

ER

Pa

sP

as

|,

Pa

2.

Run a batch of episodes with our agent running in its environment. Select

its actions

j

i

x

j

y

c

hidden_weights_3 = tf.Variable(

tf.truncated_normal((HIDDEN_NODES[1], HIDDEN_NODES[2]), stddev=1. /

np.sqrt(HIDDEN_NODES[1])))

output_weights = tf.Variable(tf.truncated_normal((HIDDEN_NODES[-1],

OUTPUT_NODES), stddev=1. / np.sqrt(OUTPUT_NODES)))

reward we receive from the environment, in this case, the result of our game of

tic-tac-toe. The other is meant for the actual action we will take at each time step.

244 ]In the make_move method, we do a few different things. First, we �atten board_state, which starts as the second array in a one-dimensional array that we need to

examples go into our mini-batches for training. Many different values of this work

well; 100 is one of these.

we have done. This will track when we need to kick off a mini-batch training:

reward =

a game using our old friend, the

247 ]Policy gradients in AlphaGo

Vs

fs

front of a large audience with a $1,000,000 prize for the winner. Lee Sedol was very

runs a lot wider. Many problems that we encounter can be formulated, such as

A supervised learning approach to

ti

Vr

gr

256 ]In this equation, V is the reward for a sequence of actions taken, rt is the reward received at time t in this sequence, and g is a constant, where 0 g 1, which will mean rewards further in the future are less valuable than rewards achieved sooner; this is often referred to as the discount factor. If we go back to our maze, this function

x,

Qs

ar

ds

ag

Qs

af

sa

sa

,m

,,

,,

sa

gQ

sa

aa

sa

Qa

261 ]We create a state_batch, each item of which is each of the states in the game,

minus_action_q_value = DISCOUNT_FACTOR

*(states[minus_action_index] +

if terminal[plus_action_index]:

plus_action_q_value = DISCOUNT_FACTOR *

states[plus_action_index]

else:

(states[plus_action_index] +

in the hidden layers in order to

tanh

Learning for Board Games

hits capacity will start removing items from the beginning of the queue. Making the

deque here has a maxlen of 20,000, which means we will only store the last 20,000

decided on from the previous main loop. It will be a one-hot vector:

env.render()

last_action = choose_next_action(last_state)

current_state, reward, terminal, _ =

env.step(np.argmax(last_action))

to take based on the

train on:

actions = [d[1] for d in mini_batch]

rewards = [d[2] for d in mini_batch]

current_states = [d[3] for d in mini_batch]

200 turns. After 400 games, we beat that comfortably averaging well over 300 turns

cart pole task we just looked at. For cart pole, if a bad move is made that leads to the

1: Move left

2: Stay still

3: Move right

scores.append(reward_per_game)

see the path of the ball and the direction in which both paddles are moving:

that we need to map our three actions to. This sounds like a lot, but the problem

tf.nn.conv2d(hidden_convolutional_layer_2,

convolution_weights_3, strides=[1, 1, 1, 1],

padding="SAME") + convolution_bias_3)

convolutional layer,

our convolutional layer, which is of dimensions none, 9, 11, 64 into a

feed_forward_bias_1 = tf.Variable(tf.constant(0.01,

shape=[FLAT_HIDDEN_NODES]))

final_hidden_activations = tf.nn.relu(

tf.matmul(hidden_convolutional_layer_3_flat,

feed_forward_weights_1) + feed_forward_bias_1)

feed_forward_weights_2 =

tf.Variable(tf.truncated_normal([FLAT_HIDDEN_NODES,

feed_forward_bias_2 = tf.Variable(tf.constant(0.01,

shape=[ACTIONS_COUNT]))

output_layer = tf.matmul(final_hidden_activations,

feed_forward_weights_2) + feed_forward_bias_2

is the combination of multiple frames, for breakout

use it.

Another issue we may run into is that while the cart pole is trained in as little as a

couple of minutes, for Breakout, training will be measured in days. To guard against

random. This same Q-learning algorithm has been tried on a wide range of Atari

,,

Qs

aU

sa

1

2

2

,

Us

aU

sa

Q-learning for a computer game, neither technique

is limited to that type. Originally,

Model-based learning

: In this approach, which will be discussed in more

at is the baseline actor critic. Here, the critic tries to learn the average performance of

tt

bs

()

bs

()

tt

bs

massively reducing the variance of training. If we run the cart pole task once using

()

()

()

,,

tt

tt

As

aQ

sa

Vs

=

tt

Qs

ar

Vs

that in to give us our advantage function purely in terms in

V

tt

As

ar

Vs

Vs

gives the next state given the current state and action pair:

()

t

t

sf

sa

294 ]The �rst example will reuse the MNIST digit dataset that you have seen Chapter 3, Deep

295 ]In a log of transactions, we observe that a particular customer spends an average

of $10 for their lunch every weekday. Suddenly, one day they spend $120. This is

certainly an outlier, but perhaps that day they decided to pay the whole bill with

their credit card. If a few of those transactions are orders of magnitude higher than

their expected amount, then we could identify an anomaly. An anomaly is when the

tance, transactions

of $120 or higher over three consecutive orders. In this scenario, we are talking of

anomalies because a pattern of repeated and linked outliers have been generated from

outside the expected distribution, but still acceptable. The effect on the distributions

Even if systematically

reasonable

explanation. The fraudulent activity would soon turn into the newer expected

behavior, the normality.

would do for many machine

but also the toughest. It is like a chess game. On one side, you have subject matter

unsupervised approaches. In addition, we also propose a semi-supervised schema:

Supervised

: Anomaly

models such as Multivariate Gaussian Distribution.

blob/master/hands-on_training/images/autoencoder.png)

to working with R, Spark, and Python pandas data frames.

The backend can then be switched among different engines. It can run locally in your

machine or it can be deployed in a cluster on top of Spark or Hadoop MapReduce.

H2O will automatically handle the memory occupation and will optimize the

execution plan for most of the data operations and model learning.

It provides a very fast scoring of data points against a trained model; it is advertised

to run in nanoseconds.

In addition to traditional data analysis and machine learning algorithms, it features a

few very robust implementations of deep learning models.

The general API for building models is via the

train = train_with_label[predictors]

neurons using a hyperbolic tangent as an activation function and 100 epochs (100

df = digit.as_data_frame()

the left) and the "bad" (on the right). The rightmost tail (greater than 0.05) could be

311

If we look

bottom-left shape of the digit so long to almost touch the corner.

score (~1.62) is probably explained by the abnormal handwriting style.

This

image as anomalous due to a licit property simply because the training data does not

test_ugly = sorted_test_with_error_df.tail(100)

plot_multi_digits(test_ugly, 10, 10, "ugly digits")

the easiest digit to write due to its simple structure of a straight line. Thus, digits of

The

handwriting. Thus, those are most likely to represent "anomalies". They are most

Please be careful that different runs may lead to different results due to randomness

introduced for scalability reasons due to race conditions generated by the Hogwild!

Hence, we

reproducible=True)

model.train(

training_frame=train_ecg

)

indistinguishable, which means they will have a higher reconstruction error.

layer_column = 'DF.L{}.C'.format(layer + 1)

columns = [layer_column + '1', layer_column + '2']

ax.annotate(k, v, size=20, verticalalignment='bottom',

horizontalalignment='left')

fig.canvas.draw()

representation of data points seed {}".format(seed))

What is more interesting is that the outlier points (marked with 20, 21, and 22)

We could then use the auto-encoder for reducing the dimensionality followed by

an unsupervised approach (for example, density-based clustering) to group similar

A comprehensive list of techniques for adaptive learning

How to validate both in presence and absence of ground truth

ii

hw

ii

Eh

Ew

Ex

mean will be

22

22

d

ii

wx

Eh

Ew

Ex

dd

The output

of the hidden unit j will be transformed

()

ya

nh

11

dd

()

11

12

12

3

dd

Eh

==

01

234

LL

LL

01

LL

12

LL

L

the input data

L

auto-encoder, which will be used to initialize the weights of

The decoding layers share the same initial weights and bias of the encoding

01

2345

LL

LL

LL

into

01

LL

12

LL

234

LL

In general, if the deep auto-encoder has N layers we can treat it as a stack of

()

N

ii

Ni

LL

()

1

()

,|

Bj

ww

=

()

,|

jk

jk

Bj

bW

=

backpropagation and

While SGD is the de-facto most popular training algorithm for many machine

have been proposed in the literature, but most of them are bottlenecked by the

processors, without taking

L

portion of the vector

cost function

ee

eE

Lx

Lx

L

input vector (

If we have

;1

vv

xx

av

()

Gx

ve

vv

ve

xx

bG

corresponds to a selected index of

e

Because computing the gradient is not instantaneous and any processor may have

x

x

value read many clock

in providing

conditions under

which this asynchronous, incremental gradient algorithm

converges.

331 ]If θt represents the quantity we are updating at iteration

tt

Dt

tt

gt

tt

tv

vL

momentum could lead to divergence though. Suppose we are running SGD with

ml

vv

v

Moreover, at the beginning of the learning, there may already be large gradients (the

effect of weights initialization). Thus, we would like to start with a small momentum

(for example, 0.5); once the large gradients have disappeared, we can increase the

tt

tt

tv

, if we omit the second derivative, can be approximated as:

11

11

tt

tt

LL

vL

Lv

Lv

t

11

tv

Lv

Lv

()

()

()

()

()

1

1

1

1

t

t

t

tH

LL

=

()

:,

ti

+

GL

()

=

+

property that progress along each dimension evens out over time, just like second-

EL

EL

EL

()

()

SL

EL

+

()

()

SL

=

In order to make those equations correct, we shall ensure that the units are matching.

()

()

()

()

()

()

()

()

()

()

HL

[]

SE

=

+

tt

[]

()

()

SL

=

distribution of computation

and data processing in a map/reduce fashion.

the high-level distributed algorithm as follows:

1.

Initialize

2.

Shuffling

We will cover this data replication problem at the end of the paragraph.

3.

Map

4.

Reduce

the final one. This is a monoidal and commutative operation; averaging is

associative and commutative.

5.

Validate

points in each node and p the number of processors (the nodes). The linear term is

In the preceding formula, we are not considering the memory occupation and the

replicate_

Description

0

False

Only one epoch, averaging over N models

-1

True

342 ]Sparkling Water is an alternative to the ML and MLlib frameworks for doing machine learning and one of the few alternatives for deep learning on top of Spark.Spark is designed and implemented in Scala. In order to understand the

SQL type

H2O type

NA

BinaryType

Numeric

Byte

344 ]Testing

system that:

Takes some of the available data

Divides into training and validation, making sure to not introduce biases or

In our context, we want to quantify the quality of our theory and verify it is good

enough and that it an evident advantage with respect to a much simpler theory

Fruitfulness

Is it

connections and marking them as normal or suspicious. This will allow the business

Cross-fold

validation introduces a problem with class

unbalancing. By

sR

higher the probability of being an anomaly. For auto-encoders, it could simply be the

MSE computed on the reconstruction error and rescaled to be in the range[0,1]. We

We can now validate using either the ROC or PR curve.

n

Predicted anomaly

True Positive (TP)

False Negative (FN)

True non-anomaly

False Positive (FP)

True Negative (TN)

measures of

True Positive Rate

cal

l

R

FN

==

False

positiv

er

atio

TN

=

a

curve consisting of

()

Rf

FP

()

2

Recall

Precision

Recall

=+

Precision

Recall

nm

Fr

pp

rp

Fo

We

Mass-Volume

(

sR

the score, the higher

The scoring function will give an ordering of each observation.

:0

fR

be the probability density function of the normal distribution of a

bx

Xs

xt

bs

tL

xX

sx

Telecom ParisTech/CNRS

f

the scoring

()

st

LI

()

gm

in

LI

performance criterion for MV is

()

()

LI

Cs

for

One workaround is to choose the interval

[]

,

0.999

=

1

s

up

,

sx

tt

bs

()

()

LI

Cs

IE

()

()

()

0

f

s

t

EM

t

µ

�

UMR LTCI No. 5141, Telecom ParisTech/CNRS

algorithms, where the result of one trial affects the choice of next trials and/or

Relevance performance

select in the top priority queue is important but the ranking also matters.

testing. A/B testing is a statistical hypothesis testing with two variants (the control

()

()

Iv

nK

the two groups. We expect that the performance should be very similar equivalent

: The effectiveness of the advanced model alone

: The effectiveness of the advanced model in case of security

: The effectiveness of both advanced model and security

369 ]A summary of testing

that are both

Data Science Manifesto:

reading and reasoning around the listed principles.

easy.EasyPredictModelWrapper

entire H2O framework but only the

interfaces. It can be read and used from anything that runs in a JVM.

The POJO object will contain the model class name corresponding to the model id

used in H2O (

sum = sum + reconstrunctionError[i];

} p.averageReconstructionError =

sum/reconstrunctionError.length;

rawModel = (hex.genmodel.GenModel)

Class.forName(modelClassName).newInstance();

EasyPredictModelWrapper model = new

EasyPredictModelWrapper(rawModel);

RowData row = new RowData();

row.put("Feature1", "value1");

AutoEncoderModelPrediction p = model.predictAutoEncoder(row);

System.out.println("Reconstruction error is: " +

p.averageReconstructionError);

}

use it for scoring a mock data point. We can re-adapt this code to be integrated

Any browser using simple add-ons, such as Postman in Chrome

curl, one of the most popular tools for client-side URL transfers

servers.

about 197

attention-based models 199

about 185

data, preprocessing 186

data, reading 186

example training 192

decoding 199, 200

Stochastic Gradient Descent (SGD) 328

about 3, 4, 5

support vector machines (SVM) 20, 28

about 92

testing

about 344-350