Understand Tree LSTM Model: A Completed Guide – LSTM Tutorial

By | July 29, 2020

Tree LSTM model is widely used in many deep learning fields. It is often used to process tree structure data.

For example:

Action recognition is tree structure data.

Social-media conversations are also tree structure data. One post contains some replies, a reply also contians some replies.

Non-tree structure data also can be converted to tree strucutre, for example, use dependency parsing technology to convert a sententce to a tree.

We have introduced the basic LSTM, Tree LSTM is much similar to it.

Understand Long Short-Term Memory Network(LSTM) – LSTM Tutorial

The structure of basic lstm likes:

We can find Ct and ht are only determined by one Ct-1 and ht-1. However, in Tree LSTM, each Ct and ht are determined by multiple Ct-1, Ct-2…. and ht-1, ht-2…, which is the key difference between classic lstm and tree lstm.

What is Tree LSTM?

A Tree LSTM Model can be defined as:

where C(j) denotes the set of children of node j and Ok is an operator that acts on the hidden vector hk of child k to  output .

Then we can define Tree LSTM equations as follows:

From equations above, we can find:

• The input gate and output gate are calculated base on x and
• The forget gate is calculated based on x and each hk, which is different from the input and forget gate.

How to get ?

We shoud get by its child nodes, there are some ways to get it.

Child Sum Tree Unit

The child-sum unit involves using sum of all hk vectors which means O = ∑. Therefore

where k this the number of child nodes of a parent node.

Limited: This method views weights of child nodes equally.

Child Max-Pooling Unit

The child max-pooling unit involves using the maximum of all hk vectors across a dimension.

Limited: This method only picks maximum value.

Child Convolve + MaxPooling Tree Unit

Child convolve uses convolution operation of the set of child hidden vectors i.e. O=⊗ where denotes vector convolution operation. As a normal tree node can have any number of child nodes, convolution operation using all child nodes requires a max-pooling operation to preserve the dimension of .

where denotes vector convolution operation and maxP denotes max pooling operation.

Limited: This method only picks the convoluted features with maximum value.