The Summary Template Includes a Reference Summary, a Summary of the Inner Model

Table 4.3 shows a test sample including the reference summary, the summary of the model [43] and the summary of the proposed model PG_Feature_ASDS . The source text of this test sample see Appendix C.4in the Appendix.

Reference summary

Mary Todd Lowrance, teacher at Moises e Molina high school, turned herself into Dallas independent school district police on Thursday morning. Dallas isd police said she had been in a relationship with student, who is older than 17 years old, for a couple of months. She confirmed in coworker who alerted authorities and police eventually got arrest warrant. Lowrance was booked into county jail on $5,000 bond and has been released from the Dallas county jail, according to county records. She has been on leave for several weeks while researchers worked on the case, police said.

Summary of the Pointer-Generator model, Coverage [43]

Lowrance Lowrance was accused of a male school on a $5,000 bond. Lowrance has been on leave for several weeks while researchers worked on the case. The student has been offered counseling warrant.

Summary of the PG_Feature_ASDS model

Mary Todd Lowrance, 49, arrested after being accused of having an improper relationship with a male student older than 17 years old. Miller said the teacher had been in a relationship with the student for a couple of months. Officers learned of the meaningful relationship after Lowrance disclosed details to a coworker who then alerted a Dallas isd officer.

Maybe you are interested!

Table 4.3. Summary template including reference summary, model summary in

[43] and summary of the proposed model on the CNN/Daily Mail dataset

As can be seen in Table 4.3, the summary of the proposed system gives more information than the summary of the model in [43] and the words are not repeated.

A sample summary with an article of the Baomoi dataset is shown in Table 4.4 below. The source text of this test sample can be found in Appendix C.4in the Appendix.

Reference summary

An illegal factory in Xi Mo, Qingdao, China was recently discovered using 1,000 kg of dirty oil to make moon cakes.

Summary of the Pointer-Generator model, Coverage [43]

The incident was exposed on August 30. The food safety agency, the industry and trade department, the police and the town government coordinated to investigate the incident that was exposed on August 30.

August 30th.

1,000 kg of dirty oil has just been delivered by a cargo truck. According to a reliable source, the oil originated from a processing plant in Ma Diem, Giao Chau. On August 31, a working group consisting of the food safety department, police and town authorities coordinated to investigate this incident.

Summary of the PG_Feature_ASDS model

Table 4.4. Summary template including reference summary, model summary in

[43] and summary of the proposed model on the Baomoi dataset

In Table 4.4, the main information of the document can be seen as: “1,000 kg of dirty oil has just been delivered by a cargo truck. According to a reliable source, this oil originated from a processing plant in Ma Diem, Giao Chau. On August 31, a working group consisting of the food safety department, the police and the town government coordinated to investigate this incident” . The reference summary contains most of the above information. The summary generated by the model in [43] does not contain the important information “1,000 kg of dirty oil” and only provides part of the necessary information. In addition, although the summary text of the output of the model in [43] is short and lacks key information, the phrase “the incident was exposed on August 30” is repeated twice. Meanwhile, the summary generated by the proposed model PG_Feature_ASDS provided more information than the model in [43] and did not contain repeated phrases.

Thus, it can be seen that the output summary of the proposed model is easy to understand and has no grammatical errors for both English and Vietnamese datasets.

4.6. Conclusion of chapter 4

In this chapter, the thesis proposed to develop an effective summary-oriented single-text summarization model for summarizing English and Vietnamese texts using deep learning techniques, other effective techniques and combining text features for the summarization model. The specific results achieved are as follows:

- Vectorize input text using word2vec method.

- Using seq2seq network with encoder using biLSTM network and decoder using LSTM network combining attention mechanism, word generation - word copy mechanism and coverage mechanism for summarization model.

- Incorporate sentence position and word frequency features into the summary model

turn off.

- Testing and evaluating the results of the proposed summary model PG_Feature_ASDS

for summarizing English and Vietnamese texts using CNN/Daily Mail and Baomoi datasets respectively.

The results of the chapter have been published in the work [CT2]. In the next chapter, the thesis will study and propose an extraction-oriented multi-text summarization model and summary-oriented multi-text summarization models for summarizing English and Vietnamese texts.

Chapter 5. DEVELOPING MULTI-TEXT SUMMARY METHODS

In this chapter, the thesis first proposes to develop an extraction-oriented multi-text summarization model Kmeans_Centroid_EMDS for English and Vietnamese summarization using K-means clustering technique, centroid-based method, MMR and sentence position feature to generate summaries. The Kmeans_Centroid_EMDS model is tested on DUC 2007 (English) and Corpus_TMV (Vietnamese) datasets. Then, the thesis proposes to develop a summary-oriented multi-text summarization model PG_Feature_AMDS based on the pre-trained summary-oriented single-text summarization model developed in chapter 4 and refine this single-text summarization model by further training on corresponding multi-text summarization datasets so that the proposed model PG_Feature_AMDS achieves better performance. The PG_Feature_AMDS model is tested using the DUC 2007 and DUC 2004 datasets (English); the ViMs datasets and the Corpus_TMV dataset (Vietnamese). Finally, the thesis proposes to develop a summary-oriented multi-text summarization model Ext_Abs_AMDS-mds-mmr based on the mixed summarization model built from the pre-trained single-text summarization models developed in Chapter 3 and Chapter 4 and refine this mixed summarization model by further training on the corresponding multi-text summarization datasets so that the proposed model Ext_Abs_AMDS-mds-mmr gives better results. The Ext_Abs_AMDS-mds-mmr model is also tested using the DUC 2007 and DUC 2004 datasets (English); the ViMs datasets and the Corpus_TMV dataset (Vietnamese).

5.1. Introduction to multi-document summarization problem and approach

Nowadays, the volume of news provided on the Internet is huge. There are many news articles that cover the same topic with some modified details. The need to summarize all these news articles to have concise information about the topic arises and multi-document summarization is a solution to this problem. Multi-document summarization aims to create a single summary that contains all the information of all source documents, the summary must avoid duplication of information between documents with the same content. In addition, the lack of test data for the multi-document summarization problem also causes many difficulties. It can be said that the challenge of multi-document summarization is much greater than the problem of single-document summarization. The multi-document summarization problem can be divided into 2 types stated as follows:

Multi-document summarization problem oriented extraction: Given a multi-document set consisting of G documents

Related articles on the same topic are shown as

D mul  ( D 1 , D 2 ,..., D i ,...., D G ) ; in

there:

D i is the i -th document in the multi-document set. Each document

D includes H sentences .

D i  ( s i 1 , s i 2 ,..., s ij ,...., s iH ) , in which:

s ij

is the jth sentence of text D i in the polygraph set

D mul , H has a value that varies from document to document. The task of extractive-oriented multi-document summarization is to generate a concise summary S from the set of documents D mul

consists of M sentences represented as

S  ( s ' , s ' ,..., s ' ,...., s '

) (with M < Total number of sentences in the set

1 2 i M

multi-document D mul ), where: s '  D , j  1, G . To solve the multi-document summarization problem

In this paper, the thesis approaches the problem of extractive multi-text summarization to the problem of text clustering and solves the challenges of the problem of multi-text summarization. The proposed extraction-oriented multi-text summarization method is presented in detail in section 5.2 below.

Summary-oriented multi-document summarization problem: Given a multi-document set D mul consisting of G

Texts related to the same topic are represented as D mul  ( D 1 , D 2 ,..., D i ,...., D G ) ;

in there:

D i is the i -th document in the multi-document set. Each document

D i is represented

in the form of

D i  ( x i 1 , x i 2 ,..., x ij ,...., x iL ) , with:

x ij

is the jth word of the text

D i , L is the number

word count of text

D i has a variable value depending on the text. Summary

The summary S of the multi-document set D mul is generated consisting of T words represented as

Y  ( y 1 , y 2 ,..., y i ,...., y T ) ; with:

i  1, T , y i D i

or y i  D i

(now the word is taken from the set)

vocabulary). To solve the problem of summarizing multi-text documents, the thesis deploys two approaches:

- Method 1: Convert the problem of summarization-oriented multi-text summarization into the problem of summarization-oriented single-text summarization by combining the texts in the multi-text set into a "hypertext" , this hypertext is considered as a single text and applying the proposed summarization-oriented single-text summarization techniques to generate the final summary.

- Method 2: Convert the problem of summarization-oriented multi-text summarization into the problem of summarization-oriented single-text summarization by summarizing each single-text of the multi-text set to obtain a summary, then combine these summaries into a " hypertext" . This hypertext is considered as a single-text and applies the proposed summarization-oriented single-text summarization techniques to generate the final summary.

These two summary-oriented multi-document summarization methods will be presented in section 5.3.

5.2. Extraction-oriented multi-text summarization model Kmeans_Centroid_EMDS

5.2.1. Model introduction

Studies on extraction-oriented multi-text summarization often group similar sentences from input multi-texts into clusters and select the center sentences of each cluster to include in the summary [136,137]. Cosine similarity is often used to calculate the similarity between a pair of sentences (sentences are represented as TF-IDF weighted vectors). The most frequently occurring sentence is considered the center of the cluster. However, this method does not consider the semantics of each word in the text, so the resulting summary may not be semantically good. Another problem with this approach is that some clusters may contain unimportant information from the input texts.

Some studies have applied the center-based method to generate summary texts such as [138,139]. This approach generates cluster centers containing words that are the center of all input texts. The summary is generated by collecting sentences containing the center words. The disadvantage of this approach is that it does not prevent information redundancy in the summary. To solve this problem, Carbonell and Goldstein [116] proposed the MMR method to generate summaries. However, this approach does not eliminate unimportant sentences in the summary. It can be said that creating a summary that best describes the input texts and contains at least redundant information is a major challenge in the multi-text summarization problem. To solve these problems, this thesis proposes an extraction-oriented multi-text summarization approach using the K-means clustering algorithm to cluster the sentences of the input texts. To solve the problem of selecting representative sentences for unimportant clusters, the center-based method is used to find the most central sentences and remove the clusters containing little information. In addition, the MMR method is applied to remove duplicate information between sentences in the summary. The summary is generated with a reasonable time sequence based on the position feature of sentences in the text added to the model. The method is described specifically as follows: First, the input multi-document set D mul  ( D 1 , D 2 ,..., D i ,...., D G ) is processed to merge into a single large single document consisting of

N sentences are represented as:

D  ( s 1 , s 2 ,..., s i ,...., s N ) ; with N equal to the total number of sentences of all

documents belonging to set D mul . Next, apply clustering technique to document D

to get K clusters represented as

C  ( C 1 , C 2 ,..., C i ,...., C K )

with:

i  1, K ; in which: cluster

C i  ( s i 1 , s i 2 ,..., s i n )

include n i

The sentence has the corresponding cluster center c i

determined by

algorithm. The center-based method is used to find the center sentences.

and eliminate clusters that contain little information. Sentence s * has the greatest similarity to the center

cluster

c i is chosen to represent the cluster and is collected

S * consists of K sentences corresponding to K

cluster is

S * ( s *, s * ,..., s * ) . Finally, apply the MMR method based on the

1 2 K

Similarity and positional features of sentences are used to select sentences from the set S * to include in the summary S .

5.2.2. Main components of the model

5.2.2.1. Sentence vectorization

The set of words extracted from the input text needs to be converted into vectors, the length of each vector depends on the size of the vocabulary or the selection size. The proposed model uses the word2vec method to vectorize the input text for the clustering model using the K-means algorithm.

5.2.2.2. K-means for clustering problems

, x

 

d  N ,

a) Clustering problem Input :

+ There are N data points represented that belong to exactly one cluster;

X x 1 ,x 2,

each data point

+ K is the number of clusters to find ( K )N );

Output:

+ The focus of the clusters:

m , m ,…, m 

d  1 .

1 2 K

+ Label of each data point: For each data point

x i , we call

, y iK 

y i y i 1 , y i 2 ,

is its label vector, where if

x i is divided into

cluster k then

yij  0 ,  j k (meaning there is an element of the vector

y i corresponds to the cluster

belong to

x i is 1, and all other elements are 0).

With the condition of the label vector, we rewrite it as:

y ij   0.1  ,  i , j ;

yij  1 ,  i .

j  1

(5.1)

If we consider the centroid mk to represent the kth cluster and a data point x i

is assigned to cluster k . The error vector if x i is replaced byequal to m kis  x im kI want to draw

This error vector is close to zero, that is,

x i close to

m k . This can be done

present by minimizing the square of the Euclidean distance || x  m || 2 .

ik 2

Since x is classified into cluster k , the expression || x m || 2

is rewritten as:

i ik 2

|| x  m || 2 = y || x  m || 2  y || x  m || 2 (because y

 1, y

 0,  j k )

ik 2

yes yes 2

j  1

ij ij 2

i j

The error for the entire data is:

L  Y , M  y || x  m || 2

ij ij 2

, y N , M   m 1 , m 2, , m K

i  1 j  1

in which: Y   y 1 , y 2,

are matrices created by vectors

label of each data point and the centroid of each cluster respectively. The loss function of

The K-means clustering problem is L  Y , M  with the conditions in formula (5.1).

Thus, we need to solve the optimization problem:

Y , M  arg min y || x m || 2

(5.2)

Y , M

i  1

j  1

ij ij 2

satisfy the constraints:

y ij   0.1  ,  i , j ;

yij  1 ,  i .

j  1

To solve problem (5.2) we solve the following two sub-problems:

- Problem 1: Fixed

M , find Y (know the centroids, need to find the label vectors)

to minimize the loss function.

+ With known centroids, the problem of finding label vectors for all data is solved.

brings back the problem of finding the label vector for each data point

x i as follows:

y  arg min y || x m || 2

(5.3)

i y i

j  1

ij ij 2

satisfy the conditions:

y ij   0.1  ,  i , j ;

yij  1 ,  i .

j  1

+ Because there is only one element of the label vector

y i is equal to 1 so the problem in (5.3) is

The problem of finding cluster j with the cluster centroid near point

x i best:

j  arg min || x m || 2

(5.4)

j ij 2

+ Because

|| x  m || 2 is the square of the Euclidean distance from point x to the center

ij 2 i

heart

m j so we can conclude that each point

x i belongs to the cluster whose centroid is near it.

From this, the label vector of each data point can be inferred.

- Problem 2: Fix Y , find M (know the cluster for each point, need to find the center of gravity)

new for each cluster) so that the loss function reaches its minimum value.

+ With the label vector for each data point known, the problem of finding the centroid for each cluster becomes:

m  arg min y

|| x  m

|| 2

(5.5)

j m j

i  1

ij ij 2

+ Because the function to be optimized is a continuous function and has a derivative defined at every point, we can find the solution by solving the derivative to zero.

Let l ( m )  y || x  m || 2 ( l  m is the inner function arg min ), we have the derivative:

j ij ij 2 j

i  1

 l  m j 2  N

y  m

x 

(5.6)

 m j

i  1

what

Solve the derivative equation equal to 0:

m j yij y ij x i

(5.7)

i  1

We have:

y ij x i

m j i  1

y ij

i  1

(5.8)

In formula (5.8), we see

y ij is the number of data points in cluster j

i  1

good

is the average of the points in cluster j .

b) K-means clustering algorithm

The K-means algorithm for data clustering problems [120,121,140] is an unsupervised learning algorithm, which is a popular clustering method in data clustering methods. The algorithm is summarized as follows.

Algorithm 5.1: K-means clustering algorithm

Input: Data set X, K clusters to find;

Output: Centroids M, label vector Y for each data point;

Algorithm:

1: Randomly initialize K data points as the points

initial centroid of K clusters;

2: Repeat the following steps until the convergence condition is met 2.1: For each data point, assign it to the cluster with the closest centroid;

2.2: For each cluster, recalculate the cluster centroid based on the data points belonging to that cluster;

3: Return;

5.2.2.3. Center-based text summarization

Centroid-based methods [13] are commonly used in text summarization to identify central sentences in a corpus, which are sentences that contain the necessary amount of information and are highly relevant to the main topic of the corpus. A sentence vector is represented based on the TF-IDF of the words in the sentence. A word is

from the center if the TF-IDF value is greater than a certain threshold  sent . The sentences

containing many center words will be selected to be included in the summary. The model will use BoW model with TF-IDF weights for center-based text summarization problem.

The center-based algorithm for text summarization is described below.

Algorithm 5.2: Centroid-based algorithm for text summarization

Input: Set of sentences;

Output: A summary of the set of input sentences;

Algorithm:

1: The set of sentences extracted from the input text is represented as a vector (with size equal to the size of the vocabulary) using the BoW model with TF-IDF weights.

2: Calculate the centroid vector v : The size of the centroid vector is equal to the size of the vocabulary. Each element a w v represents word w in the vocabulary set calculated by the formula: a w s TF _ IDF w , s , where S is a set

sentences, TF _ IDF w,s is the TF-IDF of word w in sentence s.

3: Calculate the centrality of sentences by calculating the similarity of the sentence vector and the centrality vector, where if a sentence has a centrality less than a threshold value  sentthen the center degree will be reset to 0.

The formula for calculating the similarity between the sentence vector s and the vector

 1  cosine  s , v   1

The central vector v is: sim ( s , v )  2 , with:

cosine  s , v  1 s vis the cosine distance (Cosine

|| s || 2 || v || 2

distance) between s and v .

4: Arrange the set of sentences in descending order of calculated centrality.

5: The summary is generated by selecting sentences in the sorted sentence set one by one and putting them into the summary.

off (these sentences must have information that overlaps with the