Hierarchical Attention Networks are usually used in text-classification tasks, as shown in Yang et al. (2016), based on the idea that to extract a document’s meaning, one has to start at smaller components like words and then work upwards.1

It’s hierarchical because the word encoding and attention scores are directly fed into a “higher-up” sentence encoder.

The proposed model projects the raw document into a vector representation, on which we build a classifier to perform document classification. In the following, we will present how we build the document level vector progressively from word vectors by using the hierarchical structure.
– p. 1482

Seo et al. in 20162 introduce a CNN version of this architecture which allows for the transfer of attention estimates throughout iterations, narrowing down more and more until the most useful indicators for finding the correct classification in a sentence or document are established.


  • [b] GRU encoder (Bahdanau et al. 2014)


  1. N16-1174.pdf

  2. [1606.02393v1] Hierarchical Attention Networks | arXiv