Understand End-To-End Memory Networks – Part 1 – A Simple Tutorial for NLP Beginners

By | May 30, 2019

End-to-End Memory Networks: https://arxiv.org/abs/1503.08895

End-to-End Memory Networks is often used to solve this kind of problem:

Two input: X and Q
One output: O

For example, in question and answer problem:
X represents sentences, such as “this milk is nice.“, “i do not like this food.
Q represents query phrases, such as “How about this milk?“,”Do you like this food?
O represetns the probility of one sentence to one query phrase.

The basic strucuture of one layer End-to-End Memory Networks is:

The Memory Network can map inputs X and Q to output O with a f function.

O = f(X,Q)

To understand Memory Network, we analysis this network step by step.

1. How to map inputs X into a vector?

Given xi represents each word in sentence “this milk is nice“, we can use a A(V * d) matrix to map each xi to a memory vector, such as

where V is the size of word vocabulary, d is the vector dimension, xi is the position index of each word in vocabulary.


(1) I checked some codes, here matrix A is a variable, it should be traind in model, which means the vector of each word xi is not pretrained( such as by word2vec or glove)

The vector of each word xi is learned by training model and they are sured after completing the model training

self.A = tf.Variable(tf.random_normal([self.nwords, self.edim], stddev=self.init_std))

(2) If you use pretraind vector of each word xi created by word2vec. you can define variable A matrix and compute:
mi = Axi
where xi is a pretrained vector of each word.

2. How to map Q to a vector?

Like input X, we also can use a B(V * d) matrix to get u vector

where B is the same with A, it is also variable and learned by training netwok.


(1) q is a phrase, it also contains some words, to get u vector, you can average all vectors of words in q

(2) words in q are pretrained. We also can use B matrix to map q to u like

u = Bq

where q are vectors of words in phrase Q. to convert u to a vector, we can average, concentrate result.

3.One sentence contains some words, each word contrubutes different weight to the same q, how to compute these weight?

We can use a softmax function to compute.

where pi represents the attention weight of  each word in sentence X to the same input phrase Q.