Basic Summarization-Oriented Single-Text Summarization Model [128]

Pre-tax Profit of Bidv Tien Giang in the Period 2011-2015 zt2i3t4l5ee zt2a3gsnon-credit services, joint stock commercial bank zt2a3ge zc2o3n4t5e6n7ts At that time, the Branch had to set aside a provision for credit risks, which reduced the Branch's income. Chart 2.2. Pre-tax profit of BIDV Tien Giang in the period 2011-2015 Unit: Billion VND 140 120 100 80 60 40 20 0 63.3 80.34 89.29 110.08 131.99 2011 2012 2013 2014 2015 Profit before tax (Source: Report on the implementation of the annual business plan of the General Planning Department of BIDV Tien Giang [24]) However, through chart 2.2, it can be seen that BIDV Tien Giang's profit is still increasing continuously, and its operating efficiency is currently leaking. This is a contribution of non-credit services, and this service segment will be increasingly focused on growth by BIDV Tien Giang to ensure the highest profit safety because credit activities have many potential risks. At the same time, focusing on developing non-credit services is consistent with one of the contents of restructuring the financial activities of credit institutions in the project "Restructuring the system of credit institutions in the period 2011-2015" approved by the Prime Minister in Decision No. 254/QD-TTg dated March 1, 2012 [14]: "Gradually shifting the business model of commercial banks towards reducing dependence on credit activities and increasing income from non-credit services". 2.2. Current status of non-credit service development at BIDV Tien Giang. 2.2.1. BIDV Tien Giang has deployed the development of non-credit services in recent times. Along with the development of the Head Office, BIDV Tien Giang's products and services are constantly improved and deployed in a diverse manner to ensure provision for many different customer groups in the area: individual customers, corporate customers, and financial institutions. Typical services are as follows: Payment services, treasury services, guarantee services, card services, trade finance, other services: Western Union, insurance commissions, consulting services, foreign exchange derivatives trading, e-banking services,... 2.2.1.1. Payment services: In accordance with the Prime Minister's Project to promote non-cash payments in Vietnam [15], banks in Tien Giang province have continuously developed payment services to reduce customers' cash usage habits through card services and electronic banking services such as: salary payment through accounts, focusing on developing card acceptance points, developing multi-purpose cards, paying social insurance by transfer, paying bills through banks, etc. Chart 2.3. Net income from payment services in the period 2011-2015 Unit: Million VND 6000 5000 4000 3000 2000 1000 0 3922 4065 4720 5084 5324 2011 2012 2013 2014 2015 Net income from payment services (Source: Report on the implementation of the annual business plan of the General Planning Department of BIDV Tien Giang [24]) Along with the technological development of the entire system, BIDV Tien Giang has a payment system with a fairly stable transaction processing speed, bringing many conveniences to customers. The results of observing chart 2.3 show that the income from payment services that the Branch has achieved has grown over the years but the speed is not high and the products are not outstanding compared to other banks. Domestic payment products such as: Online bill payment, electricity bills, water bills, insurance premiums, cable TV bills, telecommunications fees, airline tickets, etc. bring many conveniences to customers. Regarding international payment, this is an indispensable activity for foreign economic activities, BIDV Tien Giang is providing international payment methods for small enterprises producing agriculture, aquatic food and seafood that have credit relationships with banks in industrial parks in Tien Giang province such as: money transfer, collection, L/C payment. 2.2.1.2. Treasury services: BIDV Tien Giang always focuses on ensuring treasury safety and currency security, always complies with legal regulations, and minimizes risks in operations such as: counting and collecting money from customers, receiving and delivering internal transactions, collecting from the State Bank (SBV) or other credit institutions, receiving ATM funds, bundling money, etc. BIDV Tien Giang's treasury service management department is always fully equipped with modern machinery and equipment such as: money transport vehicles, fire prevention tools, money counters, money detectors, magnifying glasses, etc. to ensure absolute safety in treasury operations, immediately identifying real and fake money and other risks that may affect people and assets of the bank and customers. In addition, implementing regulation 2480/QC dated October 28, 2008 between the State Bank of Tien Giang province and the Provincial Police on coordination in the fight against counterfeit money, in the 3-year review of implementation, BIDV Tien Giang discovered, seized and submitted to the State Bank of Tien Giang province 475 banknotes of various denominations and was commended by the Provincial Police and the State Bank of Tien Giang province [17]. Chart 2.4. Net income from treasury services in the period 2011-2015 Unit: Million VND 350 300 250 200 150 100 50 0 105 122 309 289 279 2011 2012 2013 2014 2015 Net income from treasury services (Source: Report on the implementation of the annual business plan of the General Planning Department of BIDV Tien Giang [24]) However, as shown in Figure 2.4, income from treasury operations is not high and fluctuates. Specifically, in the period 2011-2013, net income increased and increased most sharply in 2013, then in the period 2013-2015, there was a downward trend. This fluctuation is due to the fact that fees collected from treasury services are often very low and can even be waived to attract customers to use other services. 2.2.1.3. Guarantee and trade finance services: BIDV Tien Giang, thanks to the advantages of the province and the favorable location of the Branch, has continuously focused on developing income from guarantee services and trade finance. Chart 2.5. Net income from guarantee and trade finance services in the period 2011-2015 Unit: Million VND 14000 12000 10000 8000 6000 4000 2000 0 5193 5695 2742 3420 8889 3992 11604 12206 5143 5312 2011 2012 2013 2014 2015 Net income from guarantee services Net income from Trade Finance (Source: Report on the implementation of the annual business plan of the General Planning Department of BIDV Tien Giang [24]) Through chart 2.5, we can see that BIDV Tien Giang's income from guarantee services and trade finance has grown over the years. The reason is: Among BIDV Tien Giang's corporate customers, the construction industry is the industry with the highest proportion of customers after the trading industry, this is a group of customers with potential to develop guarantee services. The second group of customers is corporate customers in the fields of agricultural production, livestock and seafood processing with high import and export turnover in the area. are the target of trade finance development. In addition, BIDV Tien Giang also focuses on continuously developing these customer groups to increase revenue for many other products and services in the future. 2.2.1.4. Card and POS services: As a service that BIDV Tien Giang has recently developed strongly, it can be said that this is a very potential market and has the ability to develop even more strongly in the future. Card services with outstanding advantages such as fast payment time, wide payment range, quite safe, effective and suitable for the integration trend and the Project to promote non-cash payments in Vietnam. Cards have become a modern and popular payment tool. BIDV Tien Giang early identified that developing card services is to expand the market to people in society, create capital mobilized from card-opened accounts, contribute to diversifying banking activities, enhance the image of the bank, bring the BIDV Tien Giang brand to people as quickly and easily as possible. BIDV Tien Giang is currently providing card types such as: credit cards (BIDV MasterCard Platinum, BIDV Visa Gold Precious, BIDV Visa Manchester United, BIDV Visa Classic), international debit cards (BIDV Ready Card, BIDV Manu Debit Card), domestic debit cards (BIDV Harmony Card, BIDV eTrans Card, BIDV Moving Card, BIDV-Lingo Co-branded Card, BIDV-Co.opmart Co-branded Card). These cards can be paid via POS/EDC or on the ATM system. In addition, with debit cards, customers can not only withdraw money via ATMs but also perform utilities such as mobile top-up, online payment, money transfer,... through electronic banking services. In order to attract customers with card services, BIDV Tien Giang has continuously increased the installation of ATMs. As of December 31, 2015, BIDV Tien Giang has 23 ATMs combined with 7 ATMs in the same system of BIDV My Tho, so the number of ATMs is quite large, especially in the center of My Tho City, but is not yet fully present in the districts. Basic services on ATMs such as withdrawing money, checking balances, printing short statements,... BIDV ATMs accept cards from banks in the system. Banknetvn and Smartlink, cards branded by international card organizations Union Pay (CUP), VISA, MasterCard and cards of banks in the Asian Payment Network. From here, cardholders can make bill payments for themselves or others at ATMs, by simply entering the subscriber number or customer code, booking code that service providers notify and make bill payments. Chart 2.6. Net income from card services in the period 2011-2015 Unit: Million VND 3500 3000 2500 2000 1500 1000 500 0 687 1023 1547 2267 3104 2011 2012 2013 2014 2015 Net income from card services (Source: Report on the implementation of the annual business plan of the General Planning Department of BIDV Tien Giang [24]) Through chart 2.6, it can be seen that BIDV Tien Giang's card service income is constantly growing because the Branch focuses on developing businesses operating in industrial parks, which are the source of customers for salary payment products, ATMs, BSMS. Specifically, there are companies such as Freeview, Quang Viet, Dai Thanh, which are businesses with a large number of card openings at the Branch, contributing to the increase in card service fees [25]. Table 2.6. Number of ATMs and POS machines in 2015 of some banks in Tien Giang area. Unit: Machine STT Bank name Number of ATMs Cumulative number of ATM cards POS machine 1 BIDV Tien Giang 23 97,095 22 2 BIDV My Tho 7 21,325 0 3 Agribank Tien Giang 29 115,743 77 4 Vietinbank Tien Giang 16 100,052 54 5 Dong A Tien Giang 26 97,536 11 6 Sacombank Tien Giang 24 88,513 27 7 Vietcombank Tien Giang 15 61,607 96 8 Vietinbank - Tay Tien Giang Branch 6 46,042 38 (Source: 2015 Banking Activity Data Report of the General and Internal Control Department of the Provincial State Bank [21]) Through table 2.6, the author finds that the number of ATMs of BIDV Tien Giang is not much, ranking fourth after Agribank Tien Giang, Dong A Tien Giang, Sacombank Tien Giang. The number of POS machines of BIDV Tien Giang is very small, only higher than Dong A Tien Giang and BIDV My Tho in the initial stages of merging the BIDV system. Besides, BIDV Tien Giang has a high number of cards increasing over the years (table 2.7) but the cumulative number of cards issued up to December 31, 2015 is still relatively low compared to Agribank, Vietcombank, Dong A (table 2.6). div.maincontent .content_head3 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .p { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; margin:0pt; } div.maincontent p { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; margin:0pt; } div.maincontent .s1 { color: black; font-family:"Courier New", monospace; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s2 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 13pt; } div.maincontent .s3 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s4 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s5 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s6 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s7 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 13.5pt; } div.maincontent .s8 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; } div.maincontent .s9 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: -2pt; } div.maincontent .s10 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: 5pt; } div.maincontent .s11 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: -5pt; } div.maincontent .s12 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: -3pt; } div.maincontent .s13 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: -4pt; } div.maincontent .s14 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 7.5pt; } div.maincontent .s15 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s16 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; } div.maincontent .s17 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9.5pt; } div.maincontent .s18 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -1pt; } div.maincontent .s19 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -5pt; } div.maincontent .s20 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -2pt; } div.maincontent .s21 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10pt; } div.maincontent .s22 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; } div.maincontent .s23 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -3pt; } div.maincontent .s24 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -5pt; } div.maincontent .s25 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; } div.maincontent .s26 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -4pt; } div.maincontent .s27 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -6pt; } div.maincontent .s28 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -1pt; } div.maincontent .s29 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11.5pt; } div.maincontent .s30 { color: black; font-family:Calibri, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11pt; } div.maincontent .s31 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11pt; } div.maincontent .s32 { color: black; font-family:.VnTime, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s33 { color: black; font-family:Cambria, serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; } div.maincontent .s34 { color: black; font-family:Cambria, serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10.5pt; vertical-align: -4pt; } div.maincontent .s35 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11.5pt; } div.maincontent .s36 { color: black; font-family:Arial, sans-serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s37 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 13pt; } div.maincontent .s38 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 13pt; } div.maincontent .s39 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 15pt; } div.maincontent .s40 { color: black; font-family:"Times New Roman", serif; font-style: normal; fo
Example Illustrating a Summary of an English Text

Next, the thesis will present and analyze the model of Nallapati et al. [128], which is used as a base model to develop the proposed summary model. Then, two mechanisms used in [43] will be presented to overcome the weaknesses of the base model [128]. Finally, the thesis proposes a solution to overcome the weaknesses of the model in [43] to improve the effectiveness of the summary model.

4.2. Basic summary model

Main components of the basic summarization-oriented single-text summarization model

[128] is shown in Figure 4.1 below.

Figure 4.1. Basic summarization-oriented single-text summarization model [128]

4.2.1. Seq2seq model of the model

In the seq2seq model, the encoding stage reads the input text

x  x 1 , x 2 , x 3 ,..., x i ,..., x J and encode into encoded hidden states

1 2 3

j J

heh e, he, he, ..., he, ..., he , these encoded hidden states will be the input to the

decoder to generate output summary

y  y 1 , y 2, y 3 ,..., y j,..., y T , in which:

- 𝑥 𝑖 , 𝑦 𝑗 are vectors of words from the input text and the summary text, respectively.

- 𝐽 , 𝑇 are the number of words of the input text and the summary text respectively.

The seq2seq model of the base model [128] with the encoder using a biLSTM network and the decoder using a LSTM network. The biLSTM network of the encoder encodes the sequence of input words x into the encoded hidden states

1 2 3

j J

jjj

heh e, he, he, ..., he, ..., heand

h e he he ; with:

he , he

dependent representation

in the forward and backward directions respectively,  is the concatenation, the superscript e denotes that it is the encoding stage). The LSTM network of the encoder-decoder will take the

encoded representation of the text (hidden and remembered states)

he ,

c e ,

c e ) as input to generate the summary y . The encoded vectors are used to initialize the hidden states and memory states for the decoder as follows:

h d  tanh( W ( h e  h e )  b (4.3)

c d c e c e

e 2 d J 1

e 2 d

(4.4)

0 J 1

with: the superscript d denotes that it is the decoding stage, W e2d , 𝑏 e2d are the parameters

arithmetic. At each decoding step t , the hidden state

h d is updated based on hidden state

before

t  1

and input words

E (in the training phase it is the words in

t  1

The reference summary text is represented as:

tt  1

y t  1

h dLSTM  h d, E 

(4.5)

Then, the vocabulary distribution will be calculated according to the formula:

P vocab , t

softmax ( W d 2 v h d +b d 2 v )

(4.6)

with: W d2v , 𝑏 𝑑 2 𝑣 are learning parameters ,

P vocab , t

is a vector of size equal to size

of the vocabulary set 𝑉 . We will denote the probability of generating a target word w in the vocabulary set 𝑉 as

P vocab , t ( w ) .

4.2.2. Attention mechanism applied in the model

1 2 3

jjj

When in seq2seq model using attention mechanism, the decoder not only takes the encoded representations (hidden states and memory states) but also selects the key parts of the input at each decoding pass.

With the encoded hidden states being

heh e, he, he, ...., he

(with

h e he h e ). At

At each decoding step t , the attention score of each word is computed based on the hidden code states.

The e -encoded and d -decoded hidden states are as follows:

s es  h e, h d  ( v a lign ) Ttanh  Wal lign  heh d three lignes

(4.7)

tj jt

with: W align , v align , b align

are the learning parameters.

Then, the attention distribution at this step t is computed based on the attention scores of all

all words in the input text

se , se , se , .... , se

by formula:

e 

exp  s e

 exp  s 

, j  1 ,J

t 1 t 2 t 3 tJ

(4.8)

tj J

i tk

k  1

The context vector of each word is computed using the following attention distribution:

t t j j

c e  e h e

j  1

(4.9)

With the current hidden state:

h d , the hidden attention states are computed by the formula

h ˆ dW  c eh d b

(4.10)

tcttc

Finally, the vocabulary distribution is calculated according to the formula:

P vocab , t softmax  W d 2 v h ˆ d b d 2 v 

where: W d2v , W c , b c and 𝑏 𝑑 2 𝑣 are learning parameters.

(4.11)

company

For t > 1, the hidden state

t  1

updated according to the formula:

t  1

LSTM  h d, E h ˆ d

(4.12)

in there:

E h ˆ d

is input.

4.2.3. Word-copy network

Pointer-Generator networks allow both copying words in the source text and generating words in the vocabulary. This mechanism overcomes the problem of words not in the vocabulary. This mechanism does not affect the calculation of attention distribution and context vector, so attention distribution and context vector are still calculated according to formula (4.8) and formula (4.9) respectively. Context vector

c e , hidden decryption state d

and the decoder input E t will be the input to calculate the probability

𝑝 𝑔𝑒𝑛 (generation probability) according to the following formula:

p gene , t

( W c e  W h d  WE

 b )

(4.13)

in there:

s , cts , hts , E ts

W s , c , W s , h , W s , E , b s

are the learning parameters.

( a ) is the sigmoid function.

p gene , t

is a real number and

p gene , t [0;1] .

p gene , t

is considered as a switch gate that can generate a word in the memory.

vocabulary or copy words from the source text using distributed attention (depending on whether the word is in the dictionary or not).

Call

P ( y t ) is the final distribution to predict a word, we have:

P ( y t ) 

p gene , tP g (y t ) 1 p gene , t  P c (y t )

(4.14)

Vocabulary distribution

P g ( y t ) and attention distribution

P c ( y t ) is defined as follows:

P ( y )   P vocab , t ( y t ), y t  V

(4.15)

gt  0 , y  V

 





P ( y ) 

j : x j medical

, yes

tj t

 V 1

(4.16)

ct 

  0 , y t V 1

with: V is the dictionary, V 1 are the words in the source text.

4.2.4. Coverage mechanism

The coverage mechanism proposed by Tu et al. [133] was initially used for the problem of machine translation using neural networks (NMT) to overcome the disadvantage of the attention mechanism which is to eliminate pre-existing information and avoid repetition. With this advantage, the coverage mechanism is applied to the problem of summarization.

text to solve the problem of word duplication. In this model, the covering vector e

is defined as the sum of the attentional distributions of the previous decoding steps, calculated by the formula:

t  1

t tj

u e  e

j  1

(4.17)

Therefore, the coverage vector contains attention information on each word in the input text in the previous decoding steps. This coverage vector is used to recompute

Points of attention according to the following formula:

s e  ( v a lign ) T tanh  W a lign  h e h d u e  three lignes

(4.18)

tj jtt

The reason for this change in calculation is that word repetition problems can occur due to the solving step.

The current code depends too much on the previous decoding step, so we give the coverage vector (which is calculated by the sum of the attention distributions of the previous decoding steps) as input to solve this problem.

Then, we will calculate the coverage loss value according to the formula:

tj tj

covloss t 

min( e , u e )

(4.19)

The coverage loss value will be used together with the hyperparameter 𝜆 and the loss function is calculated as:

loss t log P  w *  

min( e , u e )

(4.20)

tj tj

However, during the training phase the loss function is only calculated according to the following formula:

loss t log P  w *

(4.21)

In the testing phase, the coverage loss function will be calculated as formula (4.19). Using the coverage loss function in the testing phase will help reduce the model training time [43].

4.3. PG_Feature_ASDS summary-oriented single-text summarization model

Pascanu et al. [134] pointed out that the weakness of the summarization models developed based on RNN networks is the gradient disappearance problem, that is: When the input text is too long, the first part of the text will be forgotten. The LSTM network model does not completely solve this problem. Since the main content of articles is often located at the beginning, See et al. [43] overcome this problem by taking only the first part of the article to feed into the model. However, this solution reduces the flexibility of

summarization model because not all types of documents have important content located at the beginning of the text. To address this problem, the model proposes to add position-of-sentence information (POSI) as a feature to the model to increase the weight of sentences at the beginning of the text without truncating the length of the input text.

In addition, the output word is generated based on the attention distribution of all input words on the encoder side and the previous output words. In addition, since we take the entire text without cutting off the end of the text, as the text size increases, the attention level of each word will decrease, so the attention efficiency will decrease. To overcome this problem, the proposed model uses additional word frequency (TF) features to help the model focus on important words.

The proposed new features for the model are detailed below.

4.3.1. New proposed features for the model

4.3.1.1. Sentence position characteristics

With input text

x  x 1 , x 2 , x 3 ,...., x J 

With k sentences, we can rewrite the vector as

x  x 11 , x 21 , x 31 ,...., x Jk; in there:

x jk represents the 𝑗th word in the 𝑘th sentence . From vector 𝑥

With this , we can define a vector of length equal to vector x representing the position of the sentence.

contains the word:

x POSI

  1,1,1,..., k , k  .

Since the information is concentrated at the end of the text, we will increase the weight of the words in the text.

the top of the text up. Therefore,

x POSI is used to recalculate the attention score according to

formula (4.22) below and distribute attention according to formula (4.8) above.

( v align ) T t he  W align  h e h d  b align 

s 

tjx

POSI

(4.22)

4.3.1.2. Frequency characteristics of word occurrence

The frequency of words in a document is a parameter that determines the important words in a document. In a document, the higher the frequency of words, the higher the probability that the word is important. Using additional TF features to increase the weight of important words. The TF feature of a word is calculated according to the formula:

TF ( x i , x ) 

in there:

f ( x i , x )

max{f ( x i , x ) | j  1  J }

(4.23)

- 𝑓 ( 𝑥 𝑖 , 𝑥 ) is the number of times 𝑥 𝑖 appears in the text.

- max{ 𝑓 ( 𝑥 𝑗 , 𝑥 )| 𝑗 =1→ 𝐽 } is the maximum number of times a word appears in the text.

From vector 𝑥 , we can determine a vector with length equal to vector x representing the TF characteristic as follows:

x TF  ( TF ( x 1 , x ), TF ( x 2 , x ), TF ( x 3 , x ),...., TF ( x J , x ))

(4.24)

We use vectors

x TF

This is to recalculate to recalculate the attention point according to the formula

(4.25) below and distribute attention according to formula (4.8) above.

s 

( v align ) T t he  W align  h e h d  b align  . x TF

tjx

(4.25)

POSI

4.3.2. Proposed summary-oriented single-text summarization model

The proposed model architecture includes a seq2seq model with an encoder using a biLSTM network and a decoder using an LSTM network, and an attention mechanism is used to help the model focus on the main information of the text. Although the seq2seq model uses an attention mechanism, it still has disadvantages such as word repetition, sentence repetition, and information loss. Therefore, the proposed model uses 2 mechanisms in [43] to solve the above problems:

- Coverage mechanism: Fixes word and sentence repetition errors.

- Copy-from mechanism: Corrects information loss.

However, during the summary testing process for English (CNN/Daily Mail dataset) and Vietnamese (Baomoi dataset), the model did not give the expected results, many test samples gave inaccurate results, so the thesis proposed to add 2 new text features to the model: Sentence position feature in text (POSI) and word frequency in text (TF).

The proposed model with newly added POSI and TF features is shown in Figure 4.2 below.

Figure 4.2. Proposed summary-oriented single-text summarization model PG_Feature_ASDS

4.4. Model testing

4.4.1. Test data sets

The proposed model is tested on two datasets CNN/Daily Mail for English and Baomoi for Vietnamese. The purpose of testing on the datasets

CNN/Daily Mail is to compare the results of the proposed model with the results of summary-oriented text summarization systems for English on the same recent dataset. The experiment on the Baomoi dataset is to evaluate the effectiveness of the proposed model for another language, Vietnamese, and to ensure the generality of the proposed summary-oriented summarization approach.

4.4.2. Data preprocessing

First, the input text dataset is processed by word separation using the Stanford CoreNLP library for English texts and the UETSegment 14 library for Vietnamese texts. For the texts of the Baomoi dataset, remove meaningless words in many texts (for example: vov.vn, dantri.vn, baodautu.vn, ...) because these words do not contribute to the content of the text, remove texts without a summary or without content, and articles that are too short (less than 50 characters) are also eliminated.

Then, each data unit (including 1 summary and 1 content) is formatted according to the data type specified in Tensorflow (for both datasets). This data type is formatted for all 3 datasets: Training dataset, Validation dataset and Test dataset. At the same time, based on the training data, a vocabulary set with a size of 50,000 words is created.

4.4.3. Experimental design

The thesis tested four different models on the CNN/Daily Mail and Baomoi datasets as follows:

(i) Model 1: Basic seq2seq model with attention mechanism [128].

(ii) Model 2: Pointer - Generator Network with Coverage mechanism [43].

(iii) Model 3: The proposed system is based on [43] and adds sentence position feature.

(iv) Model 4: The proposed system is based on [43] and adds sentence position and word frequency features.

Models 1 and 2 are tested by the source code in [43] on two datasets CNN/Daily Mail and Baomoi. Models 3 and 4 are implemented by the thesis to select the proposed summary model.

The input to the model is a sequence of words from the article, each word is represented as a vector. The vocabulary size in the experiments is 50,000 words for English and Vietnamese. For the experiments, the model has a 256-dimensional hidden state and a 128-dimensional word encoding vector, the batch size is limited to 16, and the input text length is 800 words for English and 550 words for Vietnamese (since English texts are less than 800 words long and Vietnamese texts are less than 550 words long, the text length is limited to this extent, which is reasonable). The model uses the Adagrad optimizer [135] with a learning factor of 0.15 and an initial accumulation value of 0.1. When fine-tuning the model, the loss function value is used to stop the model early. During the evaluation phase, the summary length was limited to a maximum of 100 words for both datasets.

In addition, the system also tested the model of See et al. [43] on the CNN/Daily Mail dataset to evaluate the effectiveness of using the first 400 words of the text as input for the system.

14 https://github.com/phongnt570/UETsegmenter

4.5. Evaluation and comparison of results

Table 4.1 below shows the experimental results on the CNN/Daily Mail dataset. R-1, R-2 and RL measures are used to evaluate and compare the performance of the models.

Model

CNN/Daily Mail
R-1	R-2	RL
Model 1 (Seq2seq + attention) [128]	27.21	10.09	24.48
Model 2 (Pointer-Generator + Coverage) [43] ()*	29.71	12.13	28.05
Model 3 ( ()* + POSI)	31.16	12.66	28.61
Model 4 ( ()* + POSI + TF)	31.89	13.01	29.97

Maybe you are interested!

Table 4.1. Experimental results of the models on the CNN/Daily Mail dataset. The symbol '(*)' is the model of See et al. [43]

When the experiment in [43] was repeated using the first 400 words of the article as input, the R-1 score was obtained as 35.87%. However, when using the entire article as input, the R-1 measure dropped to 29.71%. This is because when feeding a long text to the model, the first part of the text is “forgotten” by the system, and the main content of the article is often located at the beginning. However, articles summarized in this way will reduce the generality of the system as well as in cases where important information may not be located in the first 400 words of the text.

Table 4.1 shows that when using the full text of the article as input, both proposed models (model 3 and model 4) outperform the systems in [128] and [43] in all three measures R-1, R-2, and RL. The experimental results show that the sentence position feature is important information in generating a quality summary and word frequency is a good indicator for text summarization tasks using deep learning techniques. When information about sentence position and word frequency are added to the model, the R-1 measure is significantly improved, 2.18% higher than the R-1 measure of the system in [43].

Table 4.2 below shows the experimental results on the Baomoi dataset.

Model

News
R-1	R-2	RL
Model 1 (Seq2seq + attn baseline) [128]	26.68	9.34	16.49
Model 2 (Pointer-Generator + Coverage) [43] ()*	28.34	11.06	18.55
Model 3 ( ()* + POSI)	29.47	11.31	18.85
Model 4 ( ()* + POSI + TF)	30.59	11.53	19.45

Table 4.2. Test results of the models on the Baomoi dataset. The symbol '(*)' is the model of See et al. [43]

The results in Table 4.2 also show that both proposed models achieve higher R-1, R-2, RL scores than the other two systems. The best proposed model obtains an R-1 measure that is 2.25% higher than the R-1 measure of the model in [43] and 3.91% higher than the R-1 measure of the baseline model in [128].

Comment