Basic Summarization-Oriented Single-Text Summarization Model [128]

Next, the thesis will present and analyze the model of Nallapati et al. [128], which is used as a base model to develop the proposed summary model. Then, two mechanisms used in [43] will be presented to overcome the weaknesses of the base model [128]. Finally, the thesis proposes a solution to overcome the weaknesses of the model in [43] to improve the effectiveness of the summary model.


4.2. Basic summary model


Main components of the basic summarization-oriented single-text summarization model

[128] is shown in Figure 4.1 below.

Figure 4.1. Basic summarization-oriented single-text summarization model [128]

4.2.1. Seq2seq model of the model


In the seq2seq model, the encoding stage reads the input text

x x 1 , x 2 , x 3 ,..., x i ,..., x J and encode into encoded hidden states

1 2 3

j J

heh e, he, he, ..., he, ..., he, these encoded hidden states will be the input to the

decoder to generate output summary

y y 1 , y 2, y 3 ,..., y j,..., y T, in which:

- 𝑥 𝑖 , 𝑦 𝑗 are vectors of words from the input text and the summary text, respectively.

- 𝐽 , 𝑇 are the number of words of the input text and the summary text respectively.

The seq2seq model of the base model [128] with the encoder using a biLSTM network and the decoder using a LSTM network. The biLSTM network of the encoder encodes the sequence of input words x into the encoded hidden states

1 2 3

j J

jjj

j

j

heh e, he, he, ..., he, ..., heand

h e he he ; with:

he , he

dependent representation

1

J

in the forward and backward directions respectively, is the concatenation, the superscript e denotes that it is the encoding stage). The LSTM network of the encoder-decoder will take the

J

encoded representation of the text (hidden and remembered states)

he ,

he ,

c e ,

1

c e ) as input to generate the summary y . The encoded vectors are used to initialize the hidden states and memory states for the decoder as follows:

h d tanh( W ( h e h e ) b (4.3)

0

c d c e c e

e 2 d J 1

e 2 d

(4.4)

0 J 1

t

with: the superscript d denotes that it is the decoding stage, W e2d , 𝑏 e2d are the parameters

arithmetic. At each decoding step t , the hidden state

h d is updated based on hidden state

before

d

h

t 1

and input words

E (in the training phase it is the words in

y

t 1

The reference summary text is represented as:

tt 1

y t 1

h dLSTM h d, E

(4.5)

Then, the vocabulary distribution will be calculated according to the formula:

P vocab , t

softmax ( W d 2 v h d +b d 2 v )

(4.6)

t

with: W d2v , 𝑏 𝑑 2 𝑣 are learning parameters ,

P vocab , t

is a vector of size equal to size

of the vocabulary set 𝑉 . We will denote the probability of generating a target word w in the vocabulary set 𝑉 as

P vocab , t ( w ) .

4.2.2. Attention mechanism applied in the model


1 2 3

J

jjj

When in seq2seq model using attention mechanism, the decoder not only takes the encoded representations (hidden states and memory states) but also selects the key parts of the input at each decoding pass.

With the encoded hidden states being

heh e, he, he, ...., he

(with

h e he h e ). At

At each decoding step t , the attention score of each word is computed based on the hidden code states.

h

h

j

t

The e -encoded and d -decoded hidden states are as follows:

s es h e, h d( v a lign ) Ttanh Wal lign heh dthree lignes

(4.7)

tj jt

with: W align , v align , b align

jt

are the learning parameters.

Then, the attention distribution at this step t is computed based on the attention scores of all

all words in the input text

se , se , se , .... , se

by formula:


e

exp s e


exp s

, j 1 ,J

t 1 t 2 t 3 tJ


(4.8)

tj

tj J

i tk

k 1

The context vector of each word is computed using the following attention distribution:

J

t t j j

c e e h e

j 1

(4.9)

t

With the current hidden state:

h d , the hidden attention states are computed by the formula

h ˆ dW c eh db

(4.10)

tcttc

Finally, the vocabulary distribution is calculated according to the formula:

t

P vocab , t softmax W d 2 v h ˆ d b d 2 v

where: W d2v , W c , b c and 𝑏 𝑑 2 𝑣 are learning parameters.


(4.11)

company

t

For t > 1, the hidden state

d

h

t 1

updated according to the formula:

h

d

t 1

LSTM h d, E h ˆ d

(4.12)

in there:

E h ˆ d

y

t

t

is input.


4.2.3. Word-copy network


Pointer-Generator networks allow both copying words in the source text and generating words in the vocabulary. This mechanism overcomes the problem of words not in the vocabulary. This mechanism does not affect the calculation of attention distribution and context vector, so attention distribution and context vector are still calculated according to formula (4.8) and formula (4.9) respectively. Context vector

t

h

t

c e , hidden decryption state d

and the decoder input E t will be the input to calculate the probability

𝑝 𝑔𝑒𝑛 (generation probability) according to the following formula:

p gene , t

( W c eW h dWE

b )

(4.13)


in there:

s , cts , hts , E ts

W s , c , W s , h , W s , E , b s

are the learning parameters.

( a ) is the sigmoid function.

p gene , t

is a real number and

p gene , t [0;1] .

p gene , t

is considered as a switch gate that can generate a word in the memory.

vocabulary or copy words from the source text using distributed attention (depending on whether the word is in the dictionary or not).

Call

P ( y t ) is the final distribution to predict a word, we have:

P ( y t )

p gene , tP g (y t )1 p gene , t P c (y t )

(4.14)

Vocabulary distribution

P g ( y t ) and attention distribution

P c ( y t ) is defined as follows:

P ( y ) P vocab , t ( y t ), y t V


(4.15)

gt 0 , y V

P ( y )


j : x j medical

t

, yes

e

tj t


V 1


(4.16)

ct

0 , y tV 1

with: V is the dictionary, V 1 are the words in the source text.

4.2.4. Coverage mechanism


The coverage mechanism proposed by Tu et al. [133] was initially used for the problem of machine translation using neural networks (NMT) to overcome the disadvantage of the attention mechanism which is to eliminate pre-existing information and avoid repetition. With this advantage, the coverage mechanism is applied to the problem of summarization.

u

t

text to solve the problem of word duplication. In this model, the covering vector e

is defined as the sum of the attentional distributions of the previous decoding steps, calculated by the formula:

t 1

t tj

u e e

j 1

(4.17)

Therefore, the coverage vector contains attention information on each word in the input text in the previous decoding steps. This coverage vector is used to recompute

Points of attention according to the following formula:

s e ( v a lign ) T tanh W a lign h e h d u e three lignes

(4.18)

tj jtt

The reason for this change in calculation is that word repetition problems can occur due to the solving step.

The current code depends too much on the previous decoding step, so we give the coverage vector (which is calculated by the sum of the attention distributions of the previous decoding steps) as input to solve this problem.

Then, we will calculate the coverage loss value according to the formula:

tj tj

j

covloss t

min( e , u e )

(4.19)

The coverage loss value will be used together with the hyperparameter 𝜆 and the loss function is calculated as:

t

loss tlog P w *

min( e , u e )

(4.20)

tj tj

j

However, during the training phase the loss function is only calculated according to the following formula:

t

loss tlog P w *

(4.21)

In the testing phase, the coverage loss function will be calculated as formula (4.19). Using the coverage loss function in the testing phase will help reduce the model training time [43].


4.3. PG_Feature_ASDS summary-oriented single-text summarization model


Pascanu et al. [134] pointed out that the weakness of the summarization models developed based on RNN networks is the gradient disappearance problem, that is: When the input text is too long, the first part of the text will be forgotten. The LSTM network model does not completely solve this problem. Since the main content of articles is often located at the beginning, See et al. [43] overcome this problem by taking only the first part of the article to feed into the model. However, this solution reduces the flexibility of

summarization model because not all types of documents have important content located at the beginning of the text. To address this problem, the model proposes to add position-of-sentence information (POSI) as a feature to the model to increase the weight of sentences at the beginning of the text without truncating the length of the input text.

In addition, the output word is generated based on the attention distribution of all input words on the encoder side and the previous output words. In addition, since we take the entire text without cutting off the end of the text, as the text size increases, the attention level of each word will decrease, so the attention efficiency will decrease. To overcome this problem, the proposed model uses additional word frequency (TF) features to help the model focus on important words.

The proposed new features for the model are detailed below.

4.3.1. New proposed features for the model


4.3.1.1. Sentence position characteristics

With input text

x x 1 , x 2 , x 3 ,...., x J

With k sentences, we can rewrite the vector as

x x 11 , x 21 , x 31 ,...., x Jk; in there:

x jk represents the 𝑗th word in the 𝑘th sentence . From vector 𝑥

With this , we can define a vector of length equal to vector x representing the position of the sentence.

contains the word:

x POSI

1,1,1,..., k , k .

Since the information is concentrated at the end of the text, we will increase the weight of the words in the text.

the top of the text up. Therefore,

x POSI is used to recalculate the attention score according to

formula (4.22) below and distribute attention according to formula (4.8) above.

( v align ) T t he W align h e h d b align

s

e

tjx

jt


POSI

(4.22)


4.3.1.2. Frequency characteristics of word occurrence

The frequency of words in a document is a parameter that determines the important words in a document. In a document, the higher the frequency of words, the higher the probability that the word is important. Using additional TF features to increase the weight of important words. The TF feature of a word is calculated according to the formula:

TF ( x i , x )

in there:

f ( x i , x )

max{f ( x i , x ) | j 1 J }

(4.23)

- 𝑓 ( 𝑥 𝑖 , 𝑥 ) is the number of times 𝑥 𝑖 appears in the text.

- max{ 𝑓 ( 𝑥 𝑗 , 𝑥 )| 𝑗 =1→ 𝐽 } is the maximum number of times a word appears in the text.

From vector 𝑥 , we can determine a vector with length equal to vector x representing the TF characteristic as follows:

x TF ( TF ( x 1 , x ), TF ( x 2 , x ), TF ( x 3 , x ),...., TF ( x J , x ))

(4.24)

We use vectors

x TF

This is to recalculate to recalculate the attention point according to the formula

(4.25) below and distribute attention according to formula (4.8) above.

s

jt

e

( v align ) T t he W align h e h d b align . x TF

tjx


(4.25)

POSI


4.3.2. Proposed summary-oriented single-text summarization model


The proposed model architecture includes a seq2seq model with an encoder using a biLSTM network and a decoder using an LSTM network, and an attention mechanism is used to help the model focus on the main information of the text. Although the seq2seq model uses an attention mechanism, it still has disadvantages such as word repetition, sentence repetition, and information loss. Therefore, the proposed model uses 2 mechanisms in [43] to solve the above problems:

- Coverage mechanism: Fixes word and sentence repetition errors.

- Copy-from mechanism: Corrects information loss.

However, during the summary testing process for English (CNN/Daily Mail dataset) and Vietnamese (Baomoi dataset), the model did not give the expected results, many test samples gave inaccurate results, so the thesis proposed to add 2 new text features to the model: Sentence position feature in text (POSI) and word frequency in text (TF).

The proposed model with newly added POSI and TF features is shown in Figure 4.2 below.


Figure 4.2. Proposed summary-oriented single-text summarization model PG_Feature_ASDS


4.4. Model testing


4.4.1. Test data sets


The proposed model is tested on two datasets CNN/Daily Mail for English and Baomoi for Vietnamese. The purpose of testing on the datasets

CNN/Daily Mail is to compare the results of the proposed model with the results of summary-oriented text summarization systems for English on the same recent dataset. The experiment on the Baomoi dataset is to evaluate the effectiveness of the proposed model for another language, Vietnamese, and to ensure the generality of the proposed summary-oriented summarization approach.

4.4.2. Data preprocessing


First, the input text dataset is processed by word separation using the Stanford CoreNLP library for English texts and the UETSegment 14 library for Vietnamese texts. For the texts of the Baomoi dataset, remove meaningless words in many texts (for example: vov.vn, dantri.vn, baodautu.vn, ...) because these words do not contribute to the content of the text, remove texts without a summary or without content, and articles that are too short (less than 50 characters) are also eliminated.

Then, each data unit (including 1 summary and 1 content) is formatted according to the data type specified in Tensorflow (for both datasets). This data type is formatted for all 3 datasets: Training dataset, Validation dataset and Test dataset. At the same time, based on the training data, a vocabulary set with a size of 50,000 words is created.

4.4.3. Experimental design


The thesis tested four different models on the CNN/Daily Mail and Baomoi datasets as follows:

(i) Model 1: Basic seq2seq model with attention mechanism [128].

(ii) Model 2: Pointer - Generator Network with Coverage mechanism [43].

(iii) Model 3: The proposed system is based on [43] and adds sentence position feature.

(iv) Model 4: The proposed system is based on [43] and adds sentence position and word frequency features.

Models 1 and 2 are tested by the source code in [43] on two datasets CNN/Daily Mail and Baomoi. Models 3 and 4 are implemented by the thesis to select the proposed summary model.

The input to the model is a sequence of words from the article, each word is represented as a vector. The vocabulary size in the experiments is 50,000 words for English and Vietnamese. For the experiments, the model has a 256-dimensional hidden state and a 128-dimensional word encoding vector, the batch size is limited to 16, and the input text length is 800 words for English and 550 words for Vietnamese (since English texts are less than 800 words long and Vietnamese texts are less than 550 words long, the text length is limited to this extent, which is reasonable). The model uses the Adagrad optimizer [135] with a learning factor of 0.15 and an initial accumulation value of 0.1. When fine-tuning the model, the loss function value is used to stop the model early. During the evaluation phase, the summary length was limited to a maximum of 100 words for both datasets.

In addition, the system also tested the model of See et al. [43] on the CNN/Daily Mail dataset to evaluate the effectiveness of using the first 400 words of the text as input for the system.


14 https://github.com/phongnt570/UETsegmenter

4.5. Evaluation and comparison of results


Table 4.1 below shows the experimental results on the CNN/Daily Mail dataset. R-1, R-2 and RL measures are used to evaluate and compare the performance of the models.


Model

CNN/Daily Mail

R-1

R-2

RL

Model 1 (Seq2seq + attention) [128]

27.21

10.09

24.48

Model 2 (Pointer-Generator + Coverage) [43] (*)

29.71

12.13

28.05

Model 3 ( (*) + POSI)

31.16

12.66

28.61

Model 4 ( (*) + POSI + TF)

31.89

13.01

29.97

Maybe you are interested!

Table 4.1. Experimental results of the models on the CNN/Daily Mail dataset. The symbol '(*)' is the model of See et al. [43]

When the experiment in [43] was repeated using the first 400 words of the article as input, the R-1 score was obtained as 35.87%. However, when using the entire article as input, the R-1 measure dropped to 29.71%. This is because when feeding a long text to the model, the first part of the text is “forgotten” by the system, and the main content of the article is often located at the beginning. However, articles summarized in this way will reduce the generality of the system as well as in cases where important information may not be located in the first 400 words of the text.

Table 4.1 shows that when using the full text of the article as input, both proposed models (model 3 and model 4) outperform the systems in [128] and [43] in all three measures R-1, R-2, and RL. The experimental results show that the sentence position feature is important information in generating a quality summary and word frequency is a good indicator for text summarization tasks using deep learning techniques. When information about sentence position and word frequency are added to the model, the R-1 measure is significantly improved, 2.18% higher than the R-1 measure of the system in [43].

Table 4.2 below shows the experimental results on the Baomoi dataset.


Model

News

R-1

R-2

RL

Model 1 (Seq2seq + attn baseline) [128]

26.68

9.34

16.49

Model 2 (Pointer-Generator + Coverage) [43] (*)

28.34

11.06

18.55

Model 3 ( (*) + POSI)

29.47

11.31

18.85

Model 4 ( (*) + POSI + TF)

30.59

11.53

19.45

Table 4.2. Test results of the models on the Baomoi dataset. The symbol '(*)' is the model of See et al. [43]

The results in Table 4.2 also show that both proposed models achieve higher R-1, R-2, RL scores than the other two systems. The best proposed model obtains an R-1 measure that is 2.25% higher than the R-1 measure of the model in [43] and 3.91% higher than the R-1 measure of the baseline model in [128].

Comment


Agree Privacy Policy *