Research on methods of indexing and searching for applied text information in libraries - Do Quang Vinh - 2

3.3.1 Coordinate Coordinate

One way to offer more flexibility than a simple binary yes-or-no is to count the number of query terms that appear in each document. The more terms that appear, the more likely the document is to be relevant. The approach is called coordinate matching. The query becomes a hybrid query, intermediate between an AND assembly query and a select OR query: a document containing any of the terms is considered a potential answer, but preference is given to the documents containing all or most of them. All the necessary information is in the IF and the setup is relatively easy.

3.3.2 Product in similarity

The process is formalized as an internal product of a query vector with a set of document vectors.

The similarity of query Q with document Dd is expressed as follows:

S(Q, Dd) = Q  Dd (3.1)

where the math  is the inner product.

Table 3.1 – Vectors for inner product calculation:

(a) Document vector; (b) Query vector.

(a)

d

 

inf

ret

sea

indexing

bui

index

inv

file

first

first

first

first

first

0

0

0

0

2

0

0

0

first

first

first

0

0

Maybe you are interested!

Research on methods of indexing and searching for applied text information in libraries - Do Quang Vinh - 2

Wd,t document vector

 

3

0

0

0

0

0

first

first

first

4

0

0

0

first

first

0

first

first

 

(b)

searching

0

0

first

0

0

0

0

0

indexing

0

0

0

first

0

0

0

0

The first problem can be solved by replacing the binary "yes" or "no" evaluation with an integer indicating how many times the term occurs in the document. This occurrence count is called the within-document frequency of the term fd,t More generally, the term t in document d can be assigned a document-term weight, denoted wd,t and other weight wq,t in the query vector. Similarity is the internal product of the two weights wd,t and wq,t – the sum of the product of the weights of the query terms and the corresponding document terms:

S(Q, Dd

) Q . Dd

n

w

t1

q,t

wd,t

(3.3)

The second problem does not emphasize hard-to-find terms. Indeed, a document with enough occurrences of a common term is always ranked first if the query contains that term, regardless of other words. This can be done by weighting the term according to its inversion document frequency (IDF). The assumption is consistent with Zipf's observations. [82], [83]. Zipf observes that the frequency of an item tends to be inversely proportional to its rank. That is, if rank is considered a measure of importance, then the weight wt of a term t is calculated as follows:

w t 1

f t

(3.5)

where: ft is the number of documents containing the term t.

Then the document vectors are calculated as follows:

wd,t = rd,t (3.8)

or wd,t = rd,t  wt (TF x IDF)

The following method for assigning term-document weights is known as the TF x IDF rule: term frequency times inverted document frequency.

The query-term weights wq,t are calculated similarly.

Assume that the document and query vectors are described by

wt = loge(1 + N / ft)

rd,t = 1 + logefd,t rq,t = 1 (3.9) wd,t = rd,t wq,t = rq,t  wt

Therefore, it is common to rely on a normalization factor to ignore the contribution of long documents. Therefore, another variation of the product rule in the similarity assessment is equal to

S(Q, Dd

) tQ wq,t wd,t

(3.10)

Dd

in there

Dd i fd,i

is the length of the document Dd obtained by counting the number of index terms.

3.3.3 Vector space model

Any term weight wt and the relative term frequencies rd,t and the assigned document rq,t and any of the document-term wd,t and query-term weights wq,t generated by this assignment, the result is the same – each document is represented by a vector in n-dimensional space and the query is also represented by an n-dimensional vector.

The similarity for a pair of vectors is the Euclidean distance:

n

t 1

w q, t - wd, t

2

(3.11)

S(Q, Dd )

Then the document vectors are calculated as follows:

wd,t = rd,t (3.8)

or wd,t = rd,t  wt (TF x IDF)

The following method for assigning term-document weights is known as the TF x IDF rule: term frequency times inverted document frequency.

The query-term weights wq,t are calculated similarly.

Assume that the document and query vectors are described by

wt = loge(1 + N / ft)

rd,t = 1 + logefd,t rq,t = 1 (3.9) wd,t = rd,t wq,t = rq,t  wt

Therefore, it is common to rely on a normalization factor to ignore the contribution of long documents. Therefore, another variation of the product rule in the similarity assessment is equal to

S(Q, Dd

) tQ wq,t wd,t

(3.10)

Dd

in there

Dd i fd,i

is the length of the document Dd obtained by counting the number of index terms.

3.3.3 Vector space model

Any term weight wt and the relative term frequencies rd,t and the assigned document rq,t and any of the document-term wd,t and query-term weights wq,t generated by this assignment, the result is the same – each document is represented by a vector in n-dimensional space and the query is also represented by an n-dimensional vector.

The similarity for a pair of vectors is the Euclidean distance:

n

t 1

w q, t - wd, t

2

(3.11)

S(Q, Dd )

What really matters is the direction indicated by the two vectors or more precisely the difference in direction, regardless of length.

XY

Angle is calculated from

cos

X Y

(3.14)

Law of cosine for ratings:

cos(Q, Dd ) ‌

Q Dd 1

Q Dd

WW

n

∑ wq,t wd,t

(3.15)

qdt1

where: Wd is the Euclidean length – weight – of the document d; Wq is the weight of the query.

This rule can be used with any of the term weighting methods described above. For example, suppose that the variant described in equation (3.9) is used. Then compute the similarity described by (3.18):

1 N

W

cos(Q, Dd )

d Wq

tQDd

(1 loge fd,t ) loge 1 ⎟⎟

f

t

3.4 ASSESSMENT OF SEARCH PERFORMANCE

3.4.1 Accuracy and recovery

Evaluation of search performance is based on the following two main parameters [45], [82], [83], [86], [122], [145], [159].

The precision P of a ranking method for some cut-off point r is the fraction of the highest-ranking documents r relevant to the query:

P compare the relevant cardiac data

compare the results of the heart failure

(3.19)

At what value r is the recovery (recall) R of a method?

which is the highest proportion of total relevant documents searched in r:

R Compare the relevant data with the related data

3.4.2 Recovery-accuracy curve

(3.20)

Due to the fact that we have a good score

CHEAP

P

CHEAP

200

100

0

PR Curve

200

CHEAP

P

P

100

0

CHEAP

Figure 3.1 – PR curve for class of table 3.2

3.5 COSIN MEASUREMENT

The author investigates the cosine measure. Obviously, more information is required than BQ processing and making decisions about this information should be structured to make ranking processing efficient in limited time and memory. request. The techniques developed here allow RQs to be evaluated on large databases using no more memory space and

CPU time compared to that required by the BQ rating.

3.5.1 Frequency within the document

3.5.2 Calculating the measure of cosine

The author evaluates the cosine measure using the weighting rule TFxIDF. The simplest strategy is to read each document in the database, compute a cosine value for it, and maintain a sorted list. 

rank of the highest cosine r values ​​found to the extent along with the text of the corresponding document.‌‌

3.5.3 Memory for document weights

3.5.4 Arrangement

The final component of the ranking process is sorting.

Conclusion of chapter 3

▪ Detailed analysis of the canonical information search model based on Boolean BQ query currently used in most library systems, pointing out the disadvantages of BQ query

▪ Propose a text search model based on RQ ranking query with performance evaluation based on accuracy P and recovery R.

▪ Detailed survey of cosine measure.

CHAPTER 4 - IFID CONSTRUCTION ARTICLE

4.1 INTRODUCTION

The author investigates the problem of building an IFID inverted file index, because this is the most practical index for both BQ and RQ queries.

Table 4.1 - Frequency matrix for text of table 2.2

 

Terms

inf

ret

sea

ind

bui

index

inv

file

first

first

first

-

first

-

-

-

-

2

-

-

-

first

first

first

-

-

3

-

-

-

-

-

first

first

first

4

-

-

-

first

first

-

first

first

Table 4.2 - Equivalent displacement of frequency matrix

Number

Terms

Document

first

2

3

4

first

information

first

-

-

-

2

retrieval

first

-

-

-

3

searching

-

-

-

-

4

indexing

first

first

-

first

5

building

-

first

-

first

6

index

-

first

first

-

7

inverted

-

-

first

first

8

file

-

-

first

first

4.2 KITCHEN LIST INSTALLATION ALTERNATIVE

In fact, a cross-reference is just another name for an island index, where each term of a text is listed in alphabetical order, along with a list of line numbers that appear in it. The inversion time T is:

T = Btr + Ftp + (read and parse text) I(td + tr) (write IF compressed) seconds, where symbols are defined in table 4.3.

For GB-sized databases, the linked list approach is not suitable because it requires either too much memory or too much time. However, it is the best method for small databases.

4.3 ORDER-BASED ALTERNATIVE

The main problem with the algorithm discussed above is that it requires too much memory and uses a mostly random data access sequence, which prevents an efficient mapping from memory to disk. Sequential access is the only efficient processing method for large disk files because transfer rates are often high and random searches take time. Furthermore, disk usage seems unavoidable for the amount of data under consideration and as such, the inversion algorithm should perform sequential processing on any requested disk files. The consideration leads to a sort-based inversion algorithm [4], [10], [29], [81].

The execution time is:

T = Btr + Ftp + 10ftr + (read and parse, write files) 20ftr + R(1.2k log k)tc + (sort programs) 

[log R] (20ftr + ftc)+ (mix programs)‌

10ftr + I(td + tr) (write IF compressed)

Huge disk space requirements, meaning that although allowing simple sort-based inversion is the best algorithm for medium sized databases of about 10100 MB, it is not suitable for really large databases of GB size.

4.4 DIRECT Index Compression Algorithm

4.4.1 Algorithm for mixing multiple paths

Now, the mixing is more processor oriented than disk oriented and further reduction in time can be achieved by using multipath merge, resulting in the proposed arrangement based multipath merge algorithm. by Moffat and Bell [108].

The approach can be taken more deeply. Assuming all R programs are written to a temporary file, it next performs a single R-path merge. Execution time:

T = Btr + Ftp + (read and parse) R(1.2k log k)tc + I'(tr + td) + (sort, compress and write) f [log R]tc + I'(ta) /b + tr + td) + (mix)

I(tr + td) (recompressed) seconds, where b ≤ M/R is the size of the input buffer allocated for each program and k, R and I' as above.

4.4.2 Algorithm to mix multiple paths in place

During the R-line merge described above, 1 block b B from each program is in memory, providing recruitment into the heap. At the start of mixing, the first block from each program is read. Every time the last triple from any particular block is put on the heap, a replacement block is read. Assume the last block in each program is so stuffed that it is too exact long b. Padding slightly increases the size of the temporary file but means that each compressed program takes up an integer block; As we'll soon notice, this allows for significant space savings elsewhere.

The execution time is:

T = Btr + Ftp + (read and parse) R(1.2k log k)tc + I'(tr + td) + (sort, compress and write) f [ log R]tc + (I' + I)( ts/b + tr + td) + (mix and re-encode) 2I'( ts/b + tr) (permutation) seconds, where k = (M - L)/10, R = [f /k], b < M / (R + 1) and I' is the maximum size of IF, assuming I' = 1.35 I.

4.5 INTERNAL MEMORY INSIDE Compression Algorithm

4.5.1 Large memory inversion algorithm

Assume a machine with very large main memory. If for each term t the document frequency ft is well known at the start of the inversion, a large in-memory array can be allocated exactly the appropriate size to store the list of d and d documents. frequency fd,t. Island times are:

T = Btr + Ftp + (first pass, read and parse)

Btr + Ftp + 2I' td + I(tr + td) + (second turn, island)

4.5.2 Vocabulary-based division algorithm

Like the simple sort-based inversion algorithm, the "large memory" algorithm is only suitable for medium-sized databases. The required time is:

T = Btr + Ftp + (read and parse) l(Btr + Ftp) + 2I'td + I(tr + td) (load processing) seconds, where l is the number of loads and I' = 1.05I .

4.5.3 Text-based division algorithm

The basis for the breakdown of work, assuming the text divides itself rather than the vocabulary. First, an IF is generated for an initial burst, then, for a second burst and so on, merge all the partial IFs into a final IF. The author sees a case where in-place mixing can be performed and here a similar application where a different in-place mixing strategy can be used. The execution time is:

T = Btr + Ftp + (read and parse) Btr + Ftp + 3I'td + 2cI'(ts/b+ tr) (in-place) (I' + I) (ts/b+ tr + td) ( solid concatenation) seconds, where c = I'/(M – L/3) is the number of text bursts to be cut into and as before, I'1.05I and b are a suitable block size.

4.6 COMPARISON OF ISLAND Algorithms

The algorithms that best deal with a large database are the sort, multiple, merge, in-place algorithm in section 4.4.2 and the text-based partitioning algorithm in 4.5.3.

4.7 DYNAMIC DATABASE

Above, the author investigates indexing algorithms with the assumption that the database is static. However, a database is rarely truly static. Therefore, the problem of dynamic database cannot be ignored. A database can be dynamic in one of two ways: text expansion or index expansion.

Conclusion Chapter 4

▪ Detailed analysis of the classic algorithms: the concatenated list inversion algorithm and the sort-based inversion algorithm, thereby showing their limitation that they are only suitable for small and medium-sized text document databases. .

▪ Propose two in-place multi-path merging algorithms based on sorting and text-based partitioning algorithms.

▪ Compare the inverse algorithms, then draw conclusions that the two in-place multipath merge algorithms based on sorting and the text-based partitioning algorithm are suitable for large-sized text document databases in digital libraries.

▪ Examine the dynamic database problem in two ways: text expansion and index expansion.

CONCLUDE

The conclusions drawn from the thesis include:

1. The thesis proposes a formal model for digital libraries based on modern algebra: A digital library is a set of four (R, MC, DV, XH), in which:

▪ R is a warehouse;

▪ MC is a metadata index;

▪ DV is a set of services containing at least indexing, searching, and browsing services;

▪ XH is a community of digital library users.

2. The thesis analyzes in detail the methods of indexing text documents in digital libraries: the IFID inversion file indexing method and the SFID digitally signed indexing method, comparing the two indexing methods, drawing rules Document indexes in digital libraries are: In most applications, IF performs better than SF in the range of both index size and query speed. Compressed IF is undoubtedly the most useful indexing method of a large database of variable length text documents. The thesis analyzes global compression models and hyperbolic local compression models, thereby, proposing Bernoulli local compression models and interpolation compression for IFID based on mathematical and statistical methods. encryption method, data compression method.

3. The thesis analyzes in detail the classic information search model based on the Boolean BQ query currently used in most of the library systems, pointing out the disadvantages of the BQ query. From there, the thesis proposes a text search model based on RQ ranking query with performance evaluation based on accuracy P and recovery R.

4. The thesis analyzes in detail the classic algorithms: the concatenated linked list and the sort-based inversion algorithm, pointing out their limitations, which are only suitable for small-sized text document databases and fit. From that, the thesis proposes two algorithms for in-place multi-path merge based on sorting and text-based partitioning algorithm suitable for large-sized text document databases in digital libraries.

Future research directions

The author intends to conduct further research in the future:

1. Research on image search and indexing methods;

2. Researching video search and indexing methods;

3. Research the problem of summarizing and extracting text documents in digital libraries.

LIST OF WORKS

1. Do Quang Vinh, Quach Tuan Ngoc (2001), “A temporal object-oriented data model for structured documents”, Journal of Posts and Telecommunications & Information Technology, 160(6), p. . 29-32.

2. Do Quang Vinh (2005), "The model of compression indexing of island files in digital libraries", Proceedings of the 8th National Conference on Selected Issues of Information and Communication Technology, Hai Phong, p. 666-674.

3. Do Quang Vinh (2005), “Method of indexing documents in digital libraries”, Journal of Posts and Telecommunications & Information Technology, 265, p. 40-47.

4. Do Quang Vinh (2005), “Summary and extraction of textual documents in digital libraries”, Science and Technology Journal - Vietnam Academy of Science and Technology, vol. 43, no. 4, p. 6-14.

5. Do Quang Vinh (2006), “A method of finding information based on BCH codes in digital libraries”, Journal of Science and Technology - Vietnam Academy of Science and Technology, vol. 44, no. 1, p.11-18.

6. Do Quang Vinh (2006), "Querying and ranking text documents in digital libraries", Proceedings of the Ninth National Conference on Selected Issues of Information and Communication Technology, Da Lat .

Xem toàn bộ nội dung bài viết ᛨ
Date published: 31/03/2022