Table 4.3 shows a test sample including the reference summary, the summary of the model [43] and the summary of the proposed model PG_Feature_ASDS . The source text of this test sample see Appendix C.4in the Appendix.
Reference summary
Mary Todd Lowrance, teacher at Moises e Molina high school, turned herself into Dallas independent school district police on Thursday morning. Dallas isd police said she had been in a relationship with student, who is older than 17 years old, for a couple of months. She confirmed in coworker who alerted authorities and police eventually got arrest warrant. Lowrance was booked into county jail on $5,000 bond and has been released from the Dallas county jail, according to county records. She has been on leave for several weeks while researchers worked on the case, police said. |
Summary of the Pointer-Generator model, Coverage [43] |
Lowrance Lowrance was accused of a male school on a $5,000 bond. Lowrance has been on leave for several weeks while researchers worked on the case. The student has been offered counseling warrant. |
Summary of the PG_Feature_ASDS model |
Mary Todd Lowrance, 49, arrested after being accused of having an improper relationship with a male student older than 17 years old. Miller said the teacher had been in a relationship with the student for a couple of months. Officers learned of the meaningful relationship after Lowrance disclosed details to a coworker who then alerted a Dallas isd officer. |
Maybe you are interested!
-
Example Illustrating a Summary of an English Text -
Identify Rating Levels and Rating Scales
zt2i3t4l5ee
zt2a3gstourism,quan lan,quang ninh,ecology,ecotourism,minh chau,van don,geography,geographical basis,tourism development,science
zt2a3ge
zc2o3n4t5e6n7ts
of the islanders. Therefore, this indicator will be divided into two sub-indicators:
a1. Natural tourism attractiveness a2. Cultural tourism attractiveness
b. Tourist capacity
The two island communes in Quan Lan have different capacities to receive tourists. Minh Chau Commune is home to many standard hotels and resorts, attracting high-income domestic and international tourists. Meanwhile, Quan Lan Commune has many motels mainly built and operated by local people, so the scale and quality are not high, and will be suitable for ordinary tourists such as students.
c. Time of exploitation of Quan Lan Island Commune:
Quan Lan tourism is seasonal due to weather and climate conditions and festivals only take place on certain days of the year, specifically in spring. In Quan Lan commune, the period from April to June and from September to November is considered the best time to visit Quan Lan because the cultural tourism activities are mainly associated with festivals taking place during this time.
Minh Chau island commune:
Tourism exploitation time is all year round, because this is a place with a number of tourist attractions with diverse ecosystems such as Bai Tu Long National Park Research Center, Tram forest, Turtle Laying Beach, so besides coming to the beach for tourism and vacation in the summer, Minh Chau will attract research groups to come for tourism combined with research at other times of the year.
d. Sustainability
The sustainability of ecotourism sites in Quan Lan and Minh Chau communes depends on the sensitivity of the ecosystems to climate changes.
landscape. In general, these tourist destinations have a fairly high level of sustainability, because they are natural ecosystems, planned and protected. However, if a large number of tourists gather at certain times, it can exceed the carrying capacity and affect the sustainability of the environment (polluted beaches, damaged trees, animals moving away from their habitats, etc.), then the sustainability of the above ecosystems (natural ecosystems, human ecosystems) will also be affected and become less sustainable.
e. Location and accessibility
Both island communes have ports to take tourists to visit from Van Don wharf:
- Quan Lan – Van Don traffic route:
Phuc Thinh – Viet Anh high-speed boat and Quang Minh high-speed boat, depart at 8am and 2pm from Van Don to Quan Lan, and at 7am and 1pm from Quan Lan to Van Don. There are also wooden boats departing at 7am and 1pm.
- Van Don - Minh Chau traffic route:
Chung Huong high-speed train, Minh Chau train, morning 7:30 and afternoon 13:30 from Van Don to Minh Chau, morning 6:30 and afternoon 13:00 from Minh Chau to Van Don.
f. Infrastructure
Despite receiving investment attention, the issue of infrastructure and technical facilities for tourism on Quan Lan Island is still an issue that needs to be resolved because it has a direct impact on the implementation of ecotourism activities. The minimum conditions for serving tourists such as accommodation, electricity, water, communication, especially medical services, and security work need to be given top priority. Ecotourism spots in Minh Chau commune are assessed to have better infrastructure and technical facilities for tourism because there are quite complete and synchronous conditions for serving tourists, meeting many needs of domestic and foreign tourists.
3.2.1.4. Determine assessment levels and assessment scales
Corresponding to the levels of each criterion, the index is the score of those levels in the order of 4, 3, 2, 1 decreasing according to the standard of each level: very attractive (4), attractive (3), average (2), less attractive (1).
3.2.1.5. Determining the coefficients of the criteria
For the assessment of DLST in the two communes of Quan Lan and Minh Chau islands, the students added evaluation coefficients to show the importance of the criteria and indicators as follows:
Coefficient 3 with criteria: Attractiveness, Exploitation time. These are the 2 most important criteria for attracting tourists to tourism in general and eco-tourism in particular, so they have the highest coefficient.
Coefficient 2 with criteria: Capacity, Infrastructure, Location and accessibility . Because the assessment area is an island commune of Van Don district, the above criteria are selected by the author with appropriate coefficients at the average level.
Coefficient 1 with criteria: Sustainability. Quan Lan has natural and human-made ecotourism sites, with high biodiversity and little impact from local human factors. Most of the ecotourism sites are still wild, so they are highly sustainable.
3.2.1.6. Results of DLST assessment on Quan Lan island
a. Assessment of the potential for natural tourism development
For Minh Chau commune:
+ Natural tourism attractiveness is determined to be very attractive (4 points) and the most important coefficient (coefficient 3), so the score of the Attractiveness criterion is 4 x 3 = 12.
+ Capacity is determined as average (2 points) and the coefficient is quite important (coefficient 2), then the score of Capacity criterion is 2 x 2 = 4.
+ Exploitation time is long (4 points), the most important coefficient (coefficient 3) so the score of the Exploitation time criterion is 4 x 3 = 12.
+ Sustainability is determined as sustainable (4 points), the important coefficient is the average coefficient (coefficient 1), so the score of the Sustainability criterion is 4 x 1 = 4 points
+ Location and accessibility are determined to be quite favorable (2 points), the coefficient is quite important (coefficient 2), the criterion score is 2 x 2 = 4 points.
+ Infrastructure is assessed as good (3 points), the coefficient is quite important (coefficient 2), then the score of the Infrastructure criterion is 3 x 2 = 6 points.
The total score for evaluating DLST in Minh Chau commune according to 6 evaluation criteria is determined as: 12 + 4 + 12 + 4 + 4 + 6 = 42 points
Similar assessment for Quan Lan commune, we have the following table:
Table 3.3: Assessment of the potential for natural ecotourism development in Quan Lan and Minh Chau communes
Attractiveness of self-tourismof course
Capacity
Mining time
Sustainability
Location and accessibility
Infrastructure
Result
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
CommuneMinh Chau
12
12
4
8
12
12
4
4
4
8
6
8
42/52
Quan CommuneLan
6
12
6
8
9
12
4
4
4
8
4
8
33/52
b. Assessment of the potential for humanistic tourism development
For Quan Lan commune:
+ The attractiveness of human tourism is determined to be very attractive (4 points) and the most important coefficient (coefficient 3), so the score of the Attractiveness criterion is 4 x 3 = 12.
+ Capacity is determined to be large (3 points) and the coefficient is quite important (coefficient 2), then the score of the Capacity criterion is 3 x 2 = 6.
+ Mining time is average (3 points), the most important coefficient (coefficient 3) so the score of the Mining time criterion is 3 x 3 = 9.
+ Sustainability is determined as sustainable (4 points), the important coefficient is the average coefficient (coefficient 1), so the score of the Sustainability criterion is 4 x 1 = 4 points.
+ Location and accessibility are determined to be quite favorable (2 points), the coefficient is quite important (coefficient 2), the criterion score is 2 x 2 = 4 points.
+ Infrastructure is rated as average (2 points), the coefficient is quite important (coefficient 2), then the score of the Infrastructure criterion is 2 x 2 = 4 points.
The total score for evaluating DLST in Quan Lan commune according to 6 evaluation criteria is determined as: 12 + 6 + 6 + 4 + 4 + 4 = 36 points.
Similar assessment with Minh Chau commune we have the following table:
Table 3.4: Assessment of the potential for developing humanistic eco-tourism in Quan Lan and Minh Chau communes
Attractiveness of human tourismliterature
Capacity
Mining time
Sustainability
Location and accessibility
Infrastructure
Result
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Point
DarkMulti
Quan CommuneLan
12
12
6
8
9
12
4
4
4
8
4
8
39/52
Minh CommuneChau
6
12
4
8
12
12
4
4
4
8
6
8
36/52
Basically, both Minh Chau and Quan Lan localities have quite favorable conditions for developing ecotourism. However, Quan Lan commune has more advantages to develop ecotourism in a humanistic direction, because this is an area with many famous historical relics such as Quan Lan Communal House, Quan Lan Pagoda, Temple worshiping the hero Tran Khanh Du, ... along with local festivals held annually such as the wind praying ceremony (March 15), Quan Lan festival (June 10-19); due to its location near the port and long exploitation time, the beaches in Quan Lan commune (especially Quan Lan beach) are no longer hygienic and clean to ensure the needs of tourists coming to relax and swim; this is also an area with many beautiful landscapes such as Got Beo wind pass, Ong Phong head, Voi Voi cave, but the ability to access these places is still very limited (dirt hill road, lots of gravel and rocks), especially during rainy and windy times; In addition, other natural resources such as mangrove forests and sea worms have not been really exploited for tourism purposes and ecotourism development. On the contrary, Minh Chau commune has more advantages in developing ecotourism in the direction of natural tourism, this is an area with diverse ecosystems such as at Rua De Beach, Bai Tu Long National Park Conservation Center...; Minh Chau beach is highly appreciated for its natural beauty and cleanliness, ranked in the top ten most beautiful beaches in Vietnam; Minh Chau commune is also home to Tram forest with a large area and a purity of up to 90%, suitable for building bridges through the forest (a very effective type of natural ecotourism currently applied by many countries) for tourists to sightsee, as well as for the purpose of studying and researching.
Figure 3.1: Thenmala Forest Bridge (India) Source: https://www.thenmalaecotourism.com/(August 21, 2019)
3.2.2. Using SWOT matrix to evaluate Quan Lan island tourism
General assessment of current tourism activities of Quan Lan island is shown through the following SWOT matrix:
Table 3.5: SWOT matrix evaluating tourism activities on Quan Lan island
Internal agent
Strengths- There is a lot of potential for tourism development, especially natural ecotourism and humanistic ecotourism.- The unskilled labor force is relatively abundant.- resource environmentunpolluted, still
Weaknesses- Poorly developed infrastructure, especially traffic routes to tourist destinations on the island.- The team of professional staff is still weak.- Tourism products in general
quite wild, originalintact
general and DLST in particularalone is monotonous.
External agents
Opportunity- Tourism is a key industry in the socio-economic development strategy of the province and Van Don economic zone.- Quan Lan was selected as a pilot area for eco-tourism development within the framework of the green growth project between Quang Ninh province and the Japanese organization JICA.- The flow of tourists and especially ecotourism in the world tends toincreasing
Challenge- Weather and climate change abnormally.- Competition in tourism products is increasingly fierce, especially with other localities in the province such as Ha Long, Mong Cai...- Awareness of tourists, especially domestic tourists, about ecotourism and nature conservation is not high.
Through summary analysis using SWOT matrix we see that:
To exploit strengths and take advantage of opportunities, it is necessary to:
- Diversify products and service types (build more tourism routes aimed at specific needs of tourists: experiential tourism immersed in nature, spiritual cultural tourism...)
- Effective exploitation of resources and differentiated products (natural resources and human resources)
div.maincontent .p { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; margin:0pt; } div.maincontent p { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; margin:0pt; } div.maincontent .s1 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 13pt; } div.maincontent .s2 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 13pt; } div.maincontent .s3 { color: #0D0D0D; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s4 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s5 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s6 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; vertical-align: -3pt; } div.maincontent .s7 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; vertical-align: -2pt; } div.maincontent .s8 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; vertical-align: -1pt; } div.maincontent .s9 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s10 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s11 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s12 { color: black; font-family:Symbol, serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s13 { color: black; font-family:Wingdings; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s14 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: 5pt; } div.maincontent .s15 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 9pt; vertical-align: 5pt; } div.maincontent .s16 { color: black; font-family:Cambria, serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s17 { color: #080808; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 14pt; } div.maincontent .s18 { color: #080808; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s19 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11pt; } div.maincontent .s20 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 10pt; } div.maincontent .s21 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: bold; text-decoration: none; font-size: 11pt; } div.maincontent .s22 { color: black; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; text-decoration: none; font-size: 11pt; } div.maincontent .s23 { color: black; font-family:"Times New Roman", serif; font-style: italic; font-weight: normal; text-decoration: none; font-size: 14pt; } div.maincontent .s24 { color: #212121; font-family:"Times New Roman", serif; font-style: normal; font-weight: normal; tex -
Summary Table of Data Illustrating the Contents of the Thesis -
Summary of Land Revenues 2005 - 2014 -
Summary of Sem Linear Structural Model Data

Table 4.3. Summary template including reference summary, model summary in
[43] and summary of the proposed model on the CNN/Daily Mail dataset
As can be seen in Table 4.3, the summary of the proposed system gives more information than the summary of the model in [43] and the words are not repeated.
A sample summary with an article of the Baomoi dataset is shown in Table 4.4 below. The source text of this test sample can be found in Appendix C.4in the Appendix.
Reference summary
An illegal factory in Xi Mo, Qingdao, China was recently discovered using 1,000 kg of dirty oil to make moon cakes. |
Summary of the Pointer-Generator model, Coverage [43] |
The incident was exposed on August 30. The food safety agency, the industry and trade department, the police and the town government coordinated to investigate the incident that was exposed on August 30. August 30th. |
1,000 kg of dirty oil has just been delivered by a cargo truck. According to a reliable source, the oil originated from a processing plant in Ma Diem, Giao Chau. On August 31, a working group consisting of the food safety department, police and town authorities coordinated to investigate this incident. |
Summary of the PG_Feature_ASDS model
Table 4.4. Summary template including reference summary, model summary in
[43] and summary of the proposed model on the Baomoi dataset
In Table 4.4, the main information of the document can be seen as: “1,000 kg of dirty oil has just been delivered by a cargo truck. According to a reliable source, this oil originated from a processing plant in Ma Diem, Giao Chau. On August 31, a working group consisting of the food safety department, the police and the town government coordinated to investigate this incident” . The reference summary contains most of the above information. The summary generated by the model in [43] does not contain the important information “1,000 kg of dirty oil” and only provides part of the necessary information. In addition, although the summary text of the output of the model in [43] is short and lacks key information, the phrase “the incident was exposed on August 30” is repeated twice. Meanwhile, the summary generated by the proposed model PG_Feature_ASDS provided more information than the model in [43] and did not contain repeated phrases.
Thus, it can be seen that the output summary of the proposed model is easy to understand and has no grammatical errors for both English and Vietnamese datasets.
4.6. Conclusion of chapter 4
In this chapter, the thesis proposed to develop an effective summary-oriented single-text summarization model for summarizing English and Vietnamese texts using deep learning techniques, other effective techniques and combining text features for the summarization model. The specific results achieved are as follows:
- Vectorize input text using word2vec method.
- Using seq2seq network with encoder using biLSTM network and decoder using LSTM network combining attention mechanism, word generation - word copy mechanism and coverage mechanism for summarization model.
- Incorporate sentence position and word frequency features into the summary model
turn off.
- Testing and evaluating the results of the proposed summary model PG_Feature_ASDS
for summarizing English and Vietnamese texts using CNN/Daily Mail and Baomoi datasets respectively.
The results of the chapter have been published in the work [CT2]. In the next chapter, the thesis will study and propose an extraction-oriented multi-text summarization model and summary-oriented multi-text summarization models for summarizing English and Vietnamese texts.
Chapter 5. DEVELOPING MULTI-TEXT SUMMARY METHODS
In this chapter, the thesis first proposes to develop an extraction-oriented multi-text summarization model Kmeans_Centroid_EMDS for English and Vietnamese summarization using K-means clustering technique, centroid-based method, MMR and sentence position feature to generate summaries. The Kmeans_Centroid_EMDS model is tested on DUC 2007 (English) and Corpus_TMV (Vietnamese) datasets. Then, the thesis proposes to develop a summary-oriented multi-text summarization model PG_Feature_AMDS based on the pre-trained summary-oriented single-text summarization model developed in chapter 4 and refine this single-text summarization model by further training on corresponding multi-text summarization datasets so that the proposed model PG_Feature_AMDS achieves better performance. The PG_Feature_AMDS model is tested using the DUC 2007 and DUC 2004 datasets (English); the ViMs datasets and the Corpus_TMV dataset (Vietnamese). Finally, the thesis proposes to develop a summary-oriented multi-text summarization model Ext_Abs_AMDS-mds-mmr based on the mixed summarization model built from the pre-trained single-text summarization models developed in Chapter 3 and Chapter 4 and refine this mixed summarization model by further training on the corresponding multi-text summarization datasets so that the proposed model Ext_Abs_AMDS-mds-mmr gives better results. The Ext_Abs_AMDS-mds-mmr model is also tested using the DUC 2007 and DUC 2004 datasets (English); the ViMs datasets and the Corpus_TMV dataset (Vietnamese).
5.1. Introduction to multi-document summarization problem and approach
Nowadays, the volume of news provided on the Internet is huge. There are many news articles that cover the same topic with some modified details. The need to summarize all these news articles to have concise information about the topic arises and multi-document summarization is a solution to this problem. Multi-document summarization aims to create a single summary that contains all the information of all source documents, the summary must avoid duplication of information between documents with the same content. In addition, the lack of test data for the multi-document summarization problem also causes many difficulties. It can be said that the challenge of multi-document summarization is much greater than the problem of single-document summarization. The multi-document summarization problem can be divided into 2 types stated as follows:
Multi-document summarization problem oriented extraction: Given a multi-document set consisting of G documents
Related articles on the same topic are shown as
D mul ( D 1 , D 2 ,..., D i ,...., D G ) ; in
there:
D i is the i -th document in the multi-document set. Each document
D includes H sentences .
D i ( s i 1 , s i 2 ,..., s ij ,...., s iH ) , in which:
s ij
is the jth sentence of text D i in the polygraph set
D mul , H has a value that varies from document to document. The task of extractive-oriented multi-document summarization is to generate a concise summary S from the set of documents D mul
consists of M sentences represented as
S ( s ' , s ' ,..., s ' ,...., s '
) (with M < Total number of sentences in the set
1 2 i M
multi-document D mul ), where: s ' D , j 1, G . To solve the multi-document summarization problem
ij
In this paper, the thesis approaches the problem of extractive multi-text summarization to the problem of text clustering and solves the challenges of the problem of multi-text summarization. The proposed extraction-oriented multi-text summarization method is presented in detail in section 5.2 below.
Summary-oriented multi-document summarization problem: Given a multi-document set D mul consisting of G
Texts related to the same topic are represented as D mul ( D 1 , D 2 ,..., D i ,...., D G ) ;
in there:
D i is the i -th document in the multi-document set. Each document
D i is represented
in the form of
D i ( x i 1 , x i 2 ,..., x ij ,...., x iL ) , with:
x ij
is the jth word of the text
D i , L is the number
word count of text
D i has a variable value depending on the text. Summary
The summary S of the multi-document set D mul is generated consisting of T words represented as
Y ( y 1 , y 2 ,..., y i ,...., y T ) ; with:
i 1, T , y i D i
or y i D i
(now the word is taken from the set)
vocabulary). To solve the problem of summarizing multi-text documents, the thesis deploys two approaches:
- Method 1: Convert the problem of summarization-oriented multi-text summarization into the problem of summarization-oriented single-text summarization by combining the texts in the multi-text set into a "hypertext" , this hypertext is considered as a single text and applying the proposed summarization-oriented single-text summarization techniques to generate the final summary.
- Method 2: Convert the problem of summarization-oriented multi-text summarization into the problem of summarization-oriented single-text summarization by summarizing each single-text of the multi-text set to obtain a summary, then combine these summaries into a " hypertext" . This hypertext is considered as a single-text and applies the proposed summarization-oriented single-text summarization techniques to generate the final summary.
These two summary-oriented multi-document summarization methods will be presented in section 5.3.
5.2. Extraction-oriented multi-text summarization model Kmeans_Centroid_EMDS
5.2.1. Model introduction
Studies on extraction-oriented multi-text summarization often group similar sentences from input multi-texts into clusters and select the center sentences of each cluster to include in the summary [136,137]. Cosine similarity is often used to calculate the similarity between a pair of sentences (sentences are represented as TF-IDF weighted vectors). The most frequently occurring sentence is considered the center of the cluster. However, this method does not consider the semantics of each word in the text, so the resulting summary may not be semantically good. Another problem with this approach is that some clusters may contain unimportant information from the input texts.
Some studies have applied the center-based method to generate summary texts such as [138,139]. This approach generates cluster centers containing words that are the center of all input texts. The summary is generated by collecting sentences containing the center words. The disadvantage of this approach is that it does not prevent information redundancy in the summary. To solve this problem, Carbonell and Goldstein [116] proposed the MMR method to generate summaries. However, this approach does not eliminate unimportant sentences in the summary. It can be said that creating a summary that best describes the input texts and contains at least redundant information is a major challenge in the multi-text summarization problem. To solve these problems, this thesis proposes an extraction-oriented multi-text summarization approach using the K-means clustering algorithm to cluster the sentences of the input texts. To solve the problem of selecting representative sentences for unimportant clusters, the center-based method is used to find the most central sentences and remove the clusters containing little information. In addition, the MMR method is applied to remove duplicate information between sentences in the summary. The summary is generated with a reasonable time sequence based on the position feature of sentences in the text added to the model. The method is described specifically as follows: First, the input multi-document set D mul ( D 1 , D 2 ,..., D i ,...., D G ) is processed to merge into a single large single document consisting of
N sentences are represented as:
D ( s 1 , s 2 ,..., s i ,...., s N ) ; with N equal to the total number of sentences of all
documents belonging to set D mul . Next, apply clustering technique to document D
to get K clusters represented as
C ( C 1 , C 2 ,..., C i ,...., C K )
with:
i 1, K ; in which: cluster
i
C i ( s i 1 , s i 2 ,..., s i n )
include n i
The sentence has the corresponding cluster center c i
determined by
algorithm. The center-based method is used to find the center sentences.
i
and eliminate clusters that contain little information. Sentence s * has the greatest similarity to the center
cluster
c i is chosen to represent the cluster and is collected
S * consists of K sentences corresponding to K
cluster is
S * ( s *, s * ,..., s * ) . Finally, apply the MMR method based on the
1 2 K
Similarity and positional features of sentences are used to select sentences from the set S * to include in the summary S .
5.2.2. Main components of the model
5.2.2.1. Sentence vectorization
The set of words extracted from the input text needs to be converted into vectors, the length of each vector depends on the size of the vocabulary or the selection size. The proposed model uses the word2vec method to vectorize the input text for the clustering model using the K-means algorithm.
5.2.2.2. K-means for clustering problems
, x
N
d N ,
a) Clustering problem Input :
+ There are N data points represented that belong to exactly one cluster;
X x 1 ,x 2,
each data point
+ K is the number of clusters to find ( K )N );
Output:
+ The focus of the clusters:
m , m ,…, m
d 1 .
1 2 K
+ Label of each data point: For each data point
x i , we call
, y iK
y i y i 1 , y i 2 ,
is its label vector, where if
x i is divided into
cluster k then
yij 0 , j k (meaning there is an element of the vector
y i corresponds to the cluster
belong to
x i is 1, and all other elements are 0).
With the condition of the label vector, we rewrite it as:
K
y ij 0.1 , i , j ;
yij 1 , i .
j 1
(5.1)
If we consider the centroid mk to represent the kth cluster and a data point x i
is assigned to cluster k . The error vector if x i is replaced byequal to m kis x im kI want to draw
This error vector is close to zero, that is,
x i close to
m k . This can be done
present by minimizing the square of the Euclidean distance || x m || 2 .
ik 2
Since x is classified into cluster k , the expression || x m || 2
is rewritten as:
i ik 2
K
|| x m || 2 = y || x m || 2 y || x m || 2 (because y
1, y
0, j k )
ik 2
yes yes 2
j 1
ij ij 2
NK
i j
The error for the entire data is:
L Y , M y || x m || 2
ij ij 2
, y N , M m 1 , m 2, , m K
i 1 j 1
in which: Y y 1 , y 2,
are matrices created by vectors
label of each data point and the centroid of each cluster respectively. The loss function of
The K-means clustering problem is L Y , M with the conditions in formula (5.1).
Thus, we need to solve the optimization problem:
NK
Y , M arg min y || x m || 2
(5.2)
Y , M
i 1
j 1
ij ij 2
K
satisfy the constraints:
y ij 0.1 , i , j ;
yij 1 , i .
j 1
To solve problem (5.2) we solve the following two sub-problems:
- Problem 1: Fixed
M , find Y (know the centroids, need to find the label vectors)
to minimize the loss function.
+ With known centroids, the problem of finding label vectors for all data is solved.
brings back the problem of finding the label vector for each data point
K
x i as follows:
y arg min y || x m || 2
(5.3)
i y i
j 1
ij ij 2
K
satisfy the conditions:
y ij 0.1 , i , j ;
yij 1 , i .
j 1
+ Because there is only one element of the label vector
y i is equal to 1 so the problem in (5.3) is
The problem of finding cluster j with the cluster centroid near point
x i best:
j arg min || x m || 2
(5.4)
j ij 2
+ Because
|| x m || 2 is the square of the Euclidean distance from point x to the center
ij 2 i
heart
m j so we can conclude that each point
x i belongs to the cluster whose centroid is near it.
From this, the label vector of each data point can be inferred.
- Problem 2: Fix Y , find M (know the cluster for each point, need to find the center of gravity)
new for each cluster) so that the loss function reaches its minimum value.
+ With the label vector for each data point known, the problem of finding the centroid for each cluster becomes:
N
m arg min y
|| x m
|| 2
(5.5)
j m j
i 1
ij ij 2
+ Because the function to be optimized is a continuous function and has a derivative defined at every point, we can find the solution by solving the derivative to zero.
N
Let l ( m ) y || x m || 2 ( l m is the inner function arg min ), we have the derivative:
j ij ij 2 j
i 1
l m j 2 N
y m
x
(5.6)
m j
i 1
what
Solve the derivative equation equal to 0:
NN
m j yij y ij x i
(5.7)
i 1
i 1
N
We have:
y ij x i
N
m j i 1
y ij
i 1
N
(5.8)
In formula (5.8), we see
y ij is the number of data points in cluster j
i 1
good
is the average of the points in cluster j .
b) K-means clustering algorithm
The K-means algorithm for data clustering problems [120,121,140] is an unsupervised learning algorithm, which is a popular clustering method in data clustering methods. The algorithm is summarized as follows.
Algorithm 5.1: K-means clustering algorithm
Input: Data set X, K clusters to find; Output: Centroids M, label vector Y for each data point; Algorithm: 1: Randomly initialize K data points as the points initial centroid of K clusters; |
2: Repeat the following steps until the convergence condition is met 2.1: For each data point, assign it to the cluster with the closest centroid;
2.2: For each cluster, recalculate the cluster centroid based on the data points belonging to that cluster;
3: Return;
5.2.2.3. Center-based text summarization
Centroid-based methods [13] are commonly used in text summarization to identify central sentences in a corpus, which are sentences that contain the necessary amount of information and are highly relevant to the main topic of the corpus. A sentence vector is represented based on the TF-IDF of the words in the sentence. A word is
from the center if the TF-IDF value is greater than a certain threshold sent . The sentences
containing many center words will be selected to be included in the summary. The model will use BoW model with TF-IDF weights for center-based text summarization problem.
The center-based algorithm for text summarization is described below.
Algorithm 5.2: Centroid-based algorithm for text summarization
Input: Set of sentences; Output: A summary of the set of input sentences; Algorithm: 1: The set of sentences extracted from the input text is represented as a vector (with size equal to the size of the vocabulary) using the BoW model with TF-IDF weights. 2: Calculate the centroid vector v : The size of the centroid vector is equal to the size of the vocabulary. Each element a w v represents word w in the vocabulary set calculated by the formula: a w s TF _ IDF w , s , where S is a set S sentences, TF _ IDF w,s is the TF-IDF of word w in sentence s. 3: Calculate the centrality of sentences by calculating the similarity of the sentence vector and the centrality vector, where if a sentence has a centrality less than a threshold value sentthen the center degree will be reset to 0. The formula for calculating the similarity between the sentence vector s and the vector 1 cosine s , v 1 The central vector v is: sim ( s , v ) 2 , with: cosine s , v 1 s vis the cosine distance (Cosine || s || 2 || v || 2 distance) between s and v . 4: Arrange the set of sentences in descending order of calculated centrality. 5: The summary is generated by selecting sentences in the sorted sentence set one by one and putting them into the summary. off (these sentences must have information that overlaps with the |





