Tree LSTM model is widely used in many deep learning fields. It is often used to process tree structure data.

For example:

**Action recognition** is tree structure data.

**Social-media conversations** are also tree structure data. One post contains some replies, a reply also contians some replies.

Non-tree structure data also can be converted to tree strucutre, for example, use dependency parsing technology to convert a sententce to a tree.

We have introduced the basic LSTM, Tree LSTM is much similar to it.

Understand Long Short-Term Memory Network(LSTM) – LSTM Tutorial

The structure of basic lstm likes:

We can find C_{t} and h_{t} are only determined by one C_{t-1} and h_{t-1}. However, in Tree LSTM, each C_{t} and h_{t} are determined by multiple C_{t-1}, C_{t-2}…. and h_{t-1}, h_{t-2}…, which is the key difference between classic lstm and tree lstm.

**What is Tree LSTM?**

A Tree LSTM Model can be defined as:

where C(j) denotes the set of children of node j and O_{k }is an operator that acts on the hidden vector h_{k} of child k to output .

Then we can define Tree LSTM equations as follows:

From equations above, we can find:

- The input gate and output gate are calculated base on x and
- The forget gate is calculated based on x and each h
_{k}, which is different from the input and forget gate.

**How to get ?**

We shoud get by its child nodes, there are some ways to get it.

**Child Sum Tree Unit**

The child-sum unit involves using sum of all h_{k} vectors which means O = ∑. Therefore

where k this the number of child nodes of a parent node.

Limited: This method views weights of child nodes equally.

**Child Max-Pooling Unit**

The child max-pooling unit involves using the maximum of all h_{k} vectors across a dimension.

Limited: This method only picks maximum value.

**Child Convolve + MaxPooling Tree Unit**

Child convolve uses convolution operation of the set of child hidden vectors i.e. O=⊗ where ⊗ denotes vector convolution operation. As a normal tree node can have any number of child nodes, convolution operation using all child nodes requires a max-pooling operation to preserve the dimension of .

where ⊗ denotes vector convolution operation and max_{P }denotes max pooling operation.

Limited: This method only picks the convoluted features with maximum value.