Research on methods of indexing and searching for applied text information in libraries  Do Quang Vinh  2
3.3.1 Coordinate Coordinate
One way to offer more flexibility than a simple binary yesorno is to count the number of query terms that appear in each document. The more terms that appear, the more likely the document is to be relevant. The approach is called coordinate matching. The query becomes a hybrid query, intermediate between an AND assembly query and a select OR query: a document containing any of the terms is considered a potential answer, but preference is given to the documents containing all or most of them. All the necessary information is in the IF and the setup is relatively easy.
3.3.2 Product in similarity
The process is formalized as an internal product of a query vector with a set of document vectors.
The similarity of query Q with document Dd is expressed as follows:
S(Q, Dd) = Q . Dd (3.1)
where the math . is the inner product.
Table 3.1 – Vectors for inner product calculation:
(a) Document vector; (b) Query vector.
(a) 
d 

inf 
ret 
sea 
indexing 
bui 
index 
inv 
file 

first 
first 
first 
first 
first 
0 
0 
0 
0 

2 
0 
0 
0 
first 
first 
first 
0 
0 
Maybe you are interested!
 Research on methods of indexing and searching for applied text information in libraries  Do Quang Vinh  1
Wd,t document vector
3 
0 
0 
0 
0 
0 
first 
first 
first 

4 
0 
0 
0 
first 
first 
0 
first 
first 

(b) 
searching 
0 
0 
first 
0 
0 
0 
0 
0 
indexing 
0 
0 
0 
first 
0 
0 
0 
0 
The first problem can be solved by replacing the binary "yes" or "no" evaluation with an integer indicating how many times the term occurs in the document. This occurrence count is called the withindocument frequency of the term fd,t More generally, the term t in document d can be assigned a documentterm weight, denoted wd,t and other weight wq,t in the query vector. Similarity is the internal product of the two weights wd,t and wq,t – the sum of the product of the weights of the query terms and the corresponding document terms:
S(Q, Dd
) Q . Dd
n
w
t1
q,t
wd,t
(3.3)
The second problem does not emphasize hardtofind terms. Indeed, a document with enough occurrences of a common term is always ranked first if the query contains that term, regardless of other words. This can be done by weighting the term according to its inversion document frequency (IDF). The assumption is consistent with Zipf's observations. [82], [83]. Zipf observes that the frequency of an item tends to be inversely proportional to its rank. That is, if rank is considered a measure of importance, then the weight wt of a term t is calculated as follows:
w t 1
f t
(3.5)
where: ft is the number of documents containing the term t.
Then the document vectors are calculated as follows:
wd,t = rd,t (3.8)
or wd,t = rd,t . wt (TF x IDF)
The following method for assigning termdocument weights is known as the TF x IDF rule: term frequency times inverted document frequency.
The queryterm weights wq,t are calculated similarly.
Assume that the document and query vectors are described by
wt = loge(1 + N / ft)
rd,t = 1 + logefd,t rq,t = 1 (3.9) wd,t = rd,t wq,t = rq,t . wt
Therefore, it is common to rely on a normalization factor to ignore the contribution of long documents. Therefore, another variation of the product rule in the similarity assessment is equal to
S(Q, Dd
) tQ wq,t wd,t
(3.10)
Dd
in there
Dd i fd,i
is the length of the document Dd obtained by counting the number of index terms.
3.3.3 Vector space model
Any term weight wt and the relative term frequencies rd,t and the assigned document rq,t and any of the documentterm wd,t and queryterm weights wq,t generated by this assignment, the result is the same – each document is represented by a vector in ndimensional space and the query is also represented by an ndimensional vector.
The similarity for a pair of vectors is the Euclidean distance:
n
t 1
w q, t  wd, t
2
(3.11)
S(Q, Dd )
Then the document vectors are calculated as follows:
wd,t = rd,t (3.8)
or wd,t = rd,t . wt (TF x IDF)
The following method for assigning termdocument weights is known as the TF x IDF rule: term frequency times inverted document frequency.
The queryterm weights wq,t are calculated similarly.
Assume that the document and query vectors are described by
wt = loge(1 + N / ft)
rd,t = 1 + logefd,t rq,t = 1 (3.9) wd,t = rd,t wq,t = rq,t . wt
Therefore, it is common to rely on a normalization factor to ignore the contribution of long documents. Therefore, another variation of the product rule in the similarity assessment is equal to
S(Q, Dd
) tQ wq,t wd,t
(3.10)
Dd
in there
Dd i fd,i
is the length of the document Dd obtained by counting the number of index terms.
3.3.3 Vector space model
Any term weight wt and the relative term frequencies rd,t and the assigned document rq,t and any of the documentterm wd,t and queryterm weights wq,t generated by this assignment, the result is the same – each document is represented by a vector in ndimensional space and the query is also represented by an ndimensional vector.
The similarity for a pair of vectors is the Euclidean distance:
n
t 1
w q, t  wd, t
2
(3.11)
S(Q, Dd )
What really matters is the direction indicated by the two vectors or more precisely the difference in direction, regardless of length.
XY
Angle is calculated from
cos
X Y
(3.14)
Law of cosine for ratings:
cos(Q, Dd )
Q Dd 1
Q Dd
WW
n
∑ wq,t wd,t
(3.15)
qdt1
where: Wd is the Euclidean length – weight – of the document d; Wq is the weight of the query.
This rule can be used with any of the term weighting methods described above. For example, suppose that the variant described in equation (3.9) is used. Then compute the similarity described by (3.18):
1 N
W
cos(Q, Dd )
d Wq
tQDd
(1 loge fd,t ) loge 1 ⎟⎟
f
t
3.4 ASSESSMENT OF SEARCH PERFORMANCE
3.4.1 Accuracy and recovery
Evaluation of search performance is based on the following two main parameters [45], [82], [83], [86], [122], [145], [159].
The precision P of a ranking method for some cutoff point r is the fraction of the highestranking documents r relevant to the query:
P compare the relevant cardiac data
compare the results of the heart failure
(3.19)
At what value r is the recovery (recall) R of a method?
which is the highest proportion of total relevant documents searched in r:
R Compare the relevant data with the related data
3.4.2 Recoveryaccuracy curve
(3.20)
Due to the fact that we have a good score
CHEAP
P
CHEAP
200
100
0
PR Curve
200
CHEAP
P
P
100
0
CHEAP
Figure 3.1 – PR curve for class of table 3.2
3.5 COSIN MEASUREMENT
The author investigates the cosine measure. Obviously, more information is required than BQ processing and making decisions about this information should be structured to make ranking processing efficient in limited time and memory. request. The techniques developed here allow RQs to be evaluated on large databases using no more memory space and
CPU time compared to that required by the BQ rating.
3.5.1 Frequency within the document
3.5.2 Calculating the measure of cosine
The author evaluates the cosine measure using the weighting rule TFxIDF. The simplest strategy is to read each document in the database, compute a cosine value for it, and maintain a sorted list.
rank of the highest cosine r values found to the extent along with the text of the corresponding document.
3.5.3 Memory for document weights
3.5.4 Arrangement
The final component of the ranking process is sorting.
Conclusion of chapter 3
▪ Detailed analysis of the canonical information search model based on Boolean BQ query currently used in most library systems, pointing out the disadvantages of BQ query
▪ Propose a text search model based on RQ ranking query with performance evaluation based on accuracy P and recovery R.
▪ Detailed survey of cosine measure.
CHAPTER 4  IFID CONSTRUCTION ARTICLE
4.1 INTRODUCTION
The author investigates the problem of building an IFID inverted file index, because this is the most practical index for both BQ and RQ queries.
Table 4.1  Frequency matrix for text of table 2.2
Terms 

inf 
ret 
sea 
ind 
bui 
index 
inv 
file 

first 
first 
first 
 
first 
 
 
 
 
2 
 
 
 
first 
first 
first 
 
 
3 
 
 
 
 
 
first 
first 
first 
4 
 
 
 
first 
first 
 
first 
first 
Table 4.2  Equivalent displacement of frequency matrix
Number 
Terms 
Document 

first 
2 
3 
4 

first 
information 
first 
 
 
 
2 
retrieval 
first 
 
 
 
3 
searching 
 
 
 
 
4 
indexing 
first 
first 
 
first 
5 
building 
 
first 
 
first 
6 
index 
 
first 
first 
 
7 
inverted 
 
 
first 
first 
8 
file 
 
 
first 
first 
4.2 KITCHEN LIST INSTALLATION ALTERNATIVE
In fact, a crossreference is just another name for an island index, where each term of a text is listed in alphabetical order, along with a list of line numbers that appear in it. The inversion time T is:
T = Btr + Ftp + (read and parse text) I(td + tr) (write IF compressed) seconds, where symbols are defined in table 4.3.
For GBsized databases, the linked list approach is not suitable because it requires either too much memory or too much time. However, it is the best method for small databases.
4.3 ORDERBASED ALTERNATIVE
The main problem with the algorithm discussed above is that it requires too much memory and uses a mostly random data access sequence, which prevents an efficient mapping from memory to disk. Sequential access is the only efficient processing method for large disk files because transfer rates are often high and random searches take time. Furthermore, disk usage seems unavoidable for the amount of data under consideration and as such, the inversion algorithm should perform sequential processing on any requested disk files. The consideration leads to a sortbased inversion algorithm [4], [10], [29], [81].
The execution time is:
T = Btr + Ftp + 10ftr + (read and parse, write files) 20ftr + R(1.2k log k)tc + (sort programs)
[log R] (20ftr + ftc)+ (mix programs)
10ftr + I(td + tr) (write IF compressed)
Huge disk space requirements, meaning that although allowing simple sortbased inversion is the best algorithm for medium sized databases of about 10100 MB, it is not suitable for really large databases of GB size.
4.4 DIRECT Index Compression Algorithm
4.4.1 Algorithm for mixing multiple paths
Now, the mixing is more processor oriented than disk oriented and further reduction in time can be achieved by using multipath merge, resulting in the proposed arrangement based multipath merge algorithm. by Moffat and Bell [108].
The approach can be taken more deeply. Assuming all R programs are written to a temporary file, it next performs a single Rpath merge. Execution time:
T = Btr + Ftp + (read and parse) R(1.2k log k)tc + I'(tr + td) + (sort, compress and write) f [log R]tc + I'(ta) /b + tr + td) + (mix)
I(tr + td) (recompressed) seconds, where b ≤ M/R is the size of the input buffer allocated for each program and k, R and I' as above.
4.4.2 Algorithm to mix multiple paths in place
During the Rline merge described above, 1 block b B from each program is in memory, providing recruitment into the heap. At the start of mixing, the first block from each program is read. Every time the last triple from any particular block is put on the heap, a replacement block is read. Assume the last block in each program is so stuffed that it is too exact long b. Padding slightly increases the size of the temporary file but means that each compressed program takes up an integer block; As we'll soon notice, this allows for significant space savings elsewhere.
The execution time is:
T = Btr + Ftp + (read and parse) R(1.2k log k)tc + I'(tr + td) + (sort, compress and write) f [ log R]tc + (I' + I)( ts/b + tr + td) + (mix and reencode) 2I'( ts/b + tr) (permutation) seconds, where k = (M  L)/10, R = [f /k], b < M / (R + 1) and I' is the maximum size of IF, assuming I' = 1.35 I.
4.5 INTERNAL MEMORY INSIDE Compression Algorithm
4.5.1 Large memory inversion algorithm
Assume a machine with very large main memory. If for each term t the document frequency ft is well known at the start of the inversion, a large inmemory array can be allocated exactly the appropriate size to store the list of d and d documents. frequency fd,t. Island times are:
T = Btr + Ftp + (first pass, read and parse)
Btr + Ftp + 2I' td + I(tr + td) + (second turn, island)
4.5.2 Vocabularybased division algorithm
Like the simple sortbased inversion algorithm, the "large memory" algorithm is only suitable for mediumsized databases. The required time is:
T = Btr + Ftp + (read and parse) l(Btr + Ftp) + 2I'td + I(tr + td) (load processing) seconds, where l is the number of loads and I' = 1.05I .
4.5.3 Textbased division algorithm
The basis for the breakdown of work, assuming the text divides itself rather than the vocabulary. First, an IF is generated for an initial burst, then, for a second burst and so on, merge all the partial IFs into a final IF. The author sees a case where inplace mixing can be performed and here a similar application where a different inplace mixing strategy can be used. The execution time is:
T = Btr + Ftp + (read and parse) Btr + Ftp + 3I'td + 2cI'(ts/b+ tr) (inplace) (I' + I) (ts/b+ tr + td) ( solid concatenation) seconds, where c = I'/(M – L/3) is the number of text bursts to be cut into and as before, I'1.05I and b are a suitable block size.
4.6 COMPARISON OF ISLAND Algorithms
The algorithms that best deal with a large database are the sort, multiple, merge, inplace algorithm in section 4.4.2 and the textbased partitioning algorithm in 4.5.3.
4.7 DYNAMIC DATABASE
Above, the author investigates indexing algorithms with the assumption that the database is static. However, a database is rarely truly static. Therefore, the problem of dynamic database cannot be ignored. A database can be dynamic in one of two ways: text expansion or index expansion.
Conclusion Chapter 4
▪ Detailed analysis of the classic algorithms: the concatenated list inversion algorithm and the sortbased inversion algorithm, thereby showing their limitation that they are only suitable for small and mediumsized text document databases. .
▪ Propose two inplace multipath merging algorithms based on sorting and textbased partitioning algorithms.
▪ Compare the inverse algorithms, then draw conclusions that the two inplace multipath merge algorithms based on sorting and the textbased partitioning algorithm are suitable for largesized text document databases in digital libraries.
▪ Examine the dynamic database problem in two ways: text expansion and index expansion.
CONCLUDE
The conclusions drawn from the thesis include:
1. The thesis proposes a formal model for digital libraries based on modern algebra: A digital library is a set of four (R, MC, DV, XH), in which:
▪ R is a warehouse;
▪ MC is a metadata index;
▪ DV is a set of services containing at least indexing, searching, and browsing services;
▪ XH is a community of digital library users.
2. The thesis analyzes in detail the methods of indexing text documents in digital libraries: the IFID inversion file indexing method and the SFID digitally signed indexing method, comparing the two indexing methods, drawing rules Document indexes in digital libraries are: In most applications, IF performs better than SF in the range of both index size and query speed. Compressed IF is undoubtedly the most useful indexing method of a large database of variable length text documents. The thesis analyzes global compression models and hyperbolic local compression models, thereby, proposing Bernoulli local compression models and interpolation compression for IFID based on mathematical and statistical methods. encryption method, data compression method.
3. The thesis analyzes in detail the classic information search model based on the Boolean BQ query currently used in most of the library systems, pointing out the disadvantages of the BQ query. From there, the thesis proposes a text search model based on RQ ranking query with performance evaluation based on accuracy P and recovery R.
4. The thesis analyzes in detail the classic algorithms: the concatenated linked list and the sortbased inversion algorithm, pointing out their limitations, which are only suitable for smallsized text document databases and fit. From that, the thesis proposes two algorithms for inplace multipath merge based on sorting and textbased partitioning algorithm suitable for largesized text document databases in digital libraries.
Future research directions
The author intends to conduct further research in the future:
1. Research on image search and indexing methods;
2. Researching video search and indexing methods;
3. Research the problem of summarizing and extracting text documents in digital libraries.
LIST OF WORKS
1. Do Quang Vinh, Quach Tuan Ngoc (2001), “A temporal objectoriented data model for structured documents”, Journal of Posts and Telecommunications & Information Technology, 160(6), p. . 2932.
2. Do Quang Vinh (2005), "The model of compression indexing of island files in digital libraries", Proceedings of the 8th National Conference on Selected Issues of Information and Communication Technology, Hai Phong, p. 666674.
3. Do Quang Vinh (2005), “Method of indexing documents in digital libraries”, Journal of Posts and Telecommunications & Information Technology, 265, p. 4047.
4. Do Quang Vinh (2005), “Summary and extraction of textual documents in digital libraries”, Science and Technology Journal  Vietnam Academy of Science and Technology, vol. 43, no. 4, p. 614.
5. Do Quang Vinh (2006), “A method of finding information based on BCH codes in digital libraries”, Journal of Science and Technology  Vietnam Academy of Science and Technology, vol. 44, no. 1, p.1118.
6. Do Quang Vinh (2006), "Querying and ranking text documents in digital libraries", Proceedings of the Ninth National Conference on Selected Issues of Information and Communication Technology, Da Lat .
Related post
 Research on the impact of loan portfolios on profitability of joint stock commercial banks  1
 Research on the impact of real estate prices on bank credit  1
 Evaluation of geographical conditions and resources serving tourism territorial organization of Vinh Phuc province  1
 Agricultural development in Nam Giang district, Quang Nam province  1
 Research on the application of exhaust heat for refrigeration and air conditioning  1
Send a message
Popular
 What is the Binance Stop limit order? How to set a stop limit order on Binance
 Developing tourism products in Da Nang city
 Is StormGain legit or a scam? StormGain reviews
 Managing the teaching of Cham characters to Cham people in Ham Thuan Bac district, Binh Thuan province
 Solutions to promote the efficiency of lending activities for individual customers at Military Commercial Joint Stock Bank
 Factors affecting liquidity risk of Vietnamese commercial banks
 How to register for a Binance account?
 How to view saved wifi password on phone without root
 How to get wifi password on Samsung without root
Latest post
 Completing the credit rating system of Bank for Agriculture and Rural Development of Vietnam  1
 The transmission from the interest rate policy of the State bank to the deposit and lending rates at Bank for Agriculture and Rural Development of Vietnam  1
 Bank lending channel and monetary policy transmission in Vietnam  1
 Developing card services at Viet A Commercial Joint Stock Bank  1
 Developing card services at Asia Commercial Joint Stock Bank  1
 Solution to complete international payment activities by documentary credit method at Joint Stock Commercial Bank for Foreign Trade of Vietnam, Vung Tau branch  1
 Ownership structure and business performance of Vietnamese commercial banks  1
 The influence of corporate social responsibility on the loyalty of customers depositing money at Vietnamese commercial banks  1
 Agricultural cooperative development in Dak Lak province  1
Message