MINISTRY OF EDUCATION AND TRAINING

**HANOI UNIVERSITY OF TECHNOLOGY**

---------------------------------------

**DO QUANG VINH**

**RESEARCH METHODS OF DIGITALIZATION AND FINDING TEXT INFORMATION APPLICATION IN DIGITAL LIBRARY**

Specialization: Ensuring Mathematics for Computers

and calculation system

Code: 1.01.10

SUMMARY OF THESIS DOCTOR OF MATHEMATICS

HANOI - 2006

The work was completed at:

Hanoi University of Science and Technology

Science instructor:

1.Dr. QUALITY TUAN NGOC

2. Assoc. FEMALE PHUONG XUAN

Reviewer 1: Assoc.Prof.Dr. HO THUAN

Information Technology Institute

Reviewer 2: Assoc.Prof.Dr. DO TRUNG TUAN

Ha Noi national university

Reviewer 3: TSKH. NGUYEN MINH HAI

Post and Telecommunications Institute of Technology

The thesis will be defended before the State Thesis Judging Committee meeting at: Hanoi University of Science and Technology at the hour of May 2006.

Thesis can be found at the library:

1. Vietnam National Library.

2. Library of Hanoi University of Science and Technology.

**PREAMBLE**

**1. RESEARCH TASKS AND METHODS**

Urgency, theoretical and practical significance of the topic Today, the World Wide Web has penetrated into daily life, and, over the years, the interface for the Web has evolved from browsing to searching. Millions of people around the world perform Web searches every day, but the technology for searching large document databases has changed little since the 1980s. The general awareness of the Net ushered in a new technological revolution. searching for information in digital libraries (DL), which followed the hardware revolution in personal computers.

Currently, DL is one of the main research directions on information technology in the world.

The task of the thesis: Research methods of indexing and searching for applied text information in digital libraries.

Research methods: Multimedia database system; index methods; encryption methods; data compression methods; methods of information search; mathematical statistics and probability methods.

**2. THESIS STRUCTURE**

▪ Introduction: presents the tasks, objects, research methods and summarizes the main contributions of the thesis.

▪ Chapter 1 presents an overview of digital libraries, proposes a formal model for digital libraries based on modern algebra.

▪ Chapter 2 presents two main methods of indexing text documents in digital libraries, analyzing in detail the IFID inverted file indexing method, global compression models and compression models.

local hyperbolic IFID, proposed Bernoulli local compression model and IFID interpolation compression.

▪ Chapter 3 presents the classic information search model: Boolean query model BQ, proposes a query model to rank RQ documents in digital libraries, evaluates search performance based on two parameters: accuracy P and recovery R.

▪ Chapter 4 presents the classic algorithms: memory-based inversion, sort-based inversion, proposes in-place multipath merge algorithms based on sorting and text-based partitioning algorithm, compares Inverse algorithm, presenting dynamic database indexing problem.

▪ Conclusion: present the conclusions of the thesis and future research directions.

**CHAPTER 1 - OVERVIEW OF DIGITAL LIBRARY**

**1.1 INTRODUCTION**

Definition 1.1 (Arms WY) [31]: A digital library is an organized repository of information with associated services in which information is stored in digital form and accessible over a network.

Definition 1.2 (Chen H., Houston AL) [43]: A digital library is an entity concerned with the creation of resources and the operation of information over global networks. DL is an organized repository of digital information.

Definition 1.3 (Reddy R., Wladawsky-Berger I.) [121]: Digital libraries are network data stores for digital text documents, images, sounds, scientific data, and the software that is the core of the Internet. today and universally accessible digital repositories of all future human knowledge.

Definition 1.4 (Sun Microsystems) [135]: A digital library is an electronic extension of the functions typically performed by users and the resources users access in a traditional library. Information resources are converted to digital form, stored in multimedia repositories, and made available through Web services.

Definition 1.5 (Witten IH, Bainbridge D.) [154]: Digital libraries are repositories of digital objects, consisting of text, video, and audio, together with methods for accessing and searching, selecting, organizing and maintenance

In a nutshell, a digital library is a huge, organized repository of digital information with associated services over the network.

**1.2 BASIC CONCEPTS**

The author presents basic concepts in DL: Document databases, computers and networks.

**1.3 INFORMATION RESEARCH IN DIGITAL LIBRARY**

The author presents the main topics of informatics research in

DL: Object model, user interface, information retrieval, database management and maintenance, interoperability.

**1.4 FORM MODEL FOR DIGITAL LIBRARY**

**1.4.1 Mathematical foundations**

The author considers the necessary mathematical basis to develop a formal model for DL. Concepts include sets, relations, functions, sequences, tuples, strings, graphs, and grammars [1], [3], [4], [7], [8], [9],

[13], [144], [147], [150].

**1.4.2 Line**

Definition 1.14: A row is a sequence whose domain is a non-empty set.

**1.4.3 Structure**

Definition 1.15: A structure is a tuple (G,L,F), where G=(V,E) is a directed graph with a set of vertices V and an edge set E, L being a set of label values and F is a function labeled F:(VE)L.

**1.4.4 Space**

Definition 1.23: A space is a measure space, measure space, probability space, vector space, or a topological space.

**1.4.5 Scenario**

Definition 1.26: A scenario is a sequence of related state transition events (e1, e2, ... , en) on the state set S such that ek = (sk, sk+1) for 1 k n.

**1.4.6 Community**

Definition 1.29: A community is a set (C, R), where:

1. C = {c1 , c2, ... , cn} is a set of conceptual communities, each of which refers to a set of instances of the same class or type;

2. R = {r1 , r2, ... , rn} is a set of relations, each relation is

a set rj = (ej, ij) where ej is a product of Cartesian ck1 x ck2 x ... x

n

ck

j

, 1 k1 < k2 < ... < knj n, specify the communities involved

into a relationship, and ij is an operation (see definition 1.26) that describes interaction or communication between instances.

**1.4.7 Definition of digital library form**

Here, the author approaches the problem by defining a "minimum" digital library, that is, the minimum set of elements that make up a digital library.

Definition 1.35: Let C be a database with controller H. An MC metadata index for C is a pair set {(h,

{mc1, mc2, ... , mckn})}, where h H and mci are descriptive metadata.

Definition 1.36: Let C be a database with controller H. A repository is a tuple (R, gt, st, dl), where R 2C is a database family (including C ) and functions gt, sto and dl satisfy:

1. gt:LEARNING maps a controller h to a numeric object gt(h).

2. sto:CxRR maps (do, C ) to the extended database {do}C .

3. dl:HxRR maps (h, C ) to a database smaller than C -{gt(h)}.

Definition 1.37: An index I : T 2H is a function, where T is an index space on an index term set and H is a set of controllers. An index service implements an index.

Definition 1.38: Let Q be a set of information needs of the user, often called a query. Let MI : Q x C R is a matching function, defined by an index I, that associates a real number with a query q Q and a numeric object do C, which represents the query representation. how well it matches the numeric object, both in terms of structure and content. A search service is a set of search scenarios {sc1, sc2, ... , sct}, where for each query q Q there is a search scenario sck = such that e0 is the start event The first is caused by a q query, and the en event is the last one that returns the matching function values MI(q, d) for all d C.

Definition 1.40: A browsing service is a script set {sc1,

... , scn} on a hypertext (i.e. events defined by the edges of the hypertext graph (VH,EH)), such that the browsing association event ei is associated with a function TL: VH x EH

ND, given a node and a search link for the target node's contents, that is, TL(vk,eki)=P(vt) for eki=(vk,vt)EH.

Definition 1.41: A digital library is a set of four (R, MC, DV, XH) , where:

▪ R is a warehouse;

▪ MC is a metadata index;

▪ DV is a set of services containing at least indexing, searching, and browsing services;

▪ XH is a community of digital library users.

**Conclusion Chapter 1**

▪ Presenting an overview of tourism and informal definitions of tourism by different authors in the world.

▪ Propose a formal model for DL based on algebra

modern: a digital library is a set of four (R, MC, DV, XH).

**CHAPTER 2 - INDICATORS OF TEXTILE DOCUMENTS**

**2.1 INTRODUCTION**

For DL, we're talking big data, millions of pages of unstructured text. Without an available, accurate, and complete index, information retrieval is likely to fail.

The author tested on TREC database (Text REtrieval

Conference). This is a very large document database, with a total of more than 2070.29 MB of documents and 741856 documents.

**2.2 IFID ISLAND FILE INSTALLATION**

Definition 2.2 (Do Trung Tuan [17]): Index/Index is a data table or data structure used to determine the position of rows in a file under certain conditions.

Definition 2.3 (Folk MJ, Zoellick B., Riccardi G. [6]): An index is a way of finding information.

Definition 2.4: An index is a mechanism for locating a given term in a document [22].

In text applications, the simplest suitable structure is an inversion file (IF)/table of contents file.

Definition 2.5 (IFID inversion file index): For each term in the dictionary, an IF contains an inversion list **(** IL **) ** which stores a list of pointers to all occurrences of that term in the text. main, where each pointer is in fact the number of documents in which the term appears. IL is sometimes thought of as a list of contents and pointers as the table of contents.

Table 2.2 - Sample documents; Each line is a document.

DOCUMENT | DOCUMENT |

first | Information retrieval is searching and indexing |

2 | Indexing is building an index |

3 | An inverted file is an index |

4 | Building an inverted file is indexing |

Maybe you are interested!

- Research on methods of indexing and searching for applied text information in libraries - Do Quang Vinh - 2

Table 2.3 - IF for text of table 2.2

Number | Terms | IL(document; location) |

first | an | (2;4), (3;1), (3;5), (4;2) |

2 | and | (1;5) |

3 | building | (2;3), (4;1) |

4 | file | (3;3), (4;4) |

5 | index | (2;5), (3;6) |

6 | indexing | (1;6), (2;1), (4;6) |

7 | information | (1;1) |

8 | inverted | (3;2), (4;3) |

9 | is | (1,3), (2,2), (3,4), (4,5) |

ten | retrieval | (1;2) |

11 | searching | (1;4) |

**2.3 DIGITAL SIGNED FILE INDICATORS SFID**

SFID is another indexing method.

▪ Digital signature file: SF is a probabilistic method for text indexing. Each document has an associated digit/or descriptor, a bit string that captures the document's content in some sense.

▪ Bitlice digital signature file: SF access can be increased faster by using bitlicing technique, i.e. bit-matrix transposition technique

**2.4 COMPARISON OF BINDING METHOD**

The author compares two methods of document primary indexing in DL: IFID inversion file indexing and SFID digitally signed file indexing. From this, the author derives the rule of document indexing in DL that: In most applications, IF performs better than SF in the range of both index size and query speed. Compressed IF is undoubtedly the most useful indexing method of a large database of variable length text documents.

**2.5 IFID . COMPRESSION MODELS**

**2.5.1 Raise the problem**

Compressed IF is the most useful indexing method of large databases of variable length text documents in DL. The size of an IF is significantly reduced by compression. Here, the author investigates models and encryption methods to compress IFID database of documents in DL.

The key to the compression problem is to consider every possible IL

stored as an increasing sequence of integers, with no loss of generality.

**2.5.2 Global compression models**

**2.5.2.1 Non-parametric model**

The simplest global code is a fixed representation of positive integers. Shannon's relationship between the ideal code length lx and the probability P[x] is as follows [144]:

lx = - log P[x] (2.3)

allows the probability distribution implied by the particular coding method to be determined.

**2.5.2.2 Global Bernoulli Model**

An obvious way to parameterize the model and possibly get better compression is to use the actual density of the pointers in the IF. Assume the total number of pointers f to be stored is known in advance. Divide f by the number of index terms and then by the number of documents, considering a probability of f /(Nn) being any randomly selected document containing any randomly selected term. Then, the pointer occurrence can be modeled as a Bernoulli process with this probability, assuming that the pointers f of IF are randomly selected from

nN possible word-document pairs in the database.

**2.5.3 Local compression models**

**2.5.3.1 Local hyperbolic model**

Another probability for a local model is to use a hyperbolic distribution [124], where the probability P[x] of a gap x is

P[x] = / x, for x = 1, 2, ..., m. (2.10)

The typical method offers better performance than the Bernoulli model but is more complicated to implement and requires the use of arithmetic encoding, so it does not offer the same decoding performance [124].

**2.5.3.2 Local Bernoulli Model**

If the frequency ft of the term t is known in advance, a Bernoulli model per separate IL can be used. The Golomb cipher is again less computationally demanding than the arithmetic encoding and for analog compression.

To mine the pattern, it is necessary to store the parameter ft with each IL, so that the exact value of b can be used during decoding. The total exercise price is small. Each compressed IL is easily prefixed with a code for ft – the code is a good choice because most frequencies can be expected to be small.

**2.5.3.3 Deviated Bernoulli model**

As with the code, the vector for the Golomb code is VG = and because the bucket sizes are all used up, a large amount of skewed symmetry of the distribution is lost. Therefore, the local Golomb code only performs at the edge better than the global and code.

**2.5.3.4 Interpolation compression model**

Although promoted as a coping mechanism for word clustering, the VT code is still a static code and is equivalent to a zero-order model for d-gap. Using a higher order model also allows clustering-sensitive compression because a small d-gap sequence is clear evidence that the next d-gap is also small. A mechanism is assumed that the parameter b used for each d-gap is equal to the average of some number of previously decoded d-gap. While attractive in theory, the secondary compression benefit is often small, and because there are more cases to be controlled, the implementation is more complex. A more subtle way in which it is possible to compress each subgroup-sensitive IL.

**2.5.4 Performance of index compression models**

Local models tend to perform better compression than global models and are not more efficient in the processing time required during decoding, as they tend to be more complex implementations. For practical purposes, the most suitable index compression model is the local Bernoulli model, implemented using the Golomb encryption technique.

Table 2.7 - IF compression by number of bits/pointers for TREC

Paradigm | Number of bits/pointers |

Global models | |

Monad | 1918 |

Binary | 20.00 |

Bernoulli | 12.30 |

| 6.63 |

| 6.38 |

Local models | |

Hyperbola | 5.89 |

Bernoulli | 5.84 |

Bernoulli deviated | 5.44 |

interpolate | 5.18 |

**2.6 EFFECTS**

The author considers the effects that affect the index of text documents in the DL: font aggregation, word origin, omitted words [31], [94],

[102], [154].

**Conclusion of chapter 2**

▪ Analyze in detail the two main methods of indexing text documents in DL: IFID inversion file indexing and SFID digitally signed indexing

▪ Compare the two indexing methods IFID and SFID, from which, the rule of document indexing in DL is drawn.

▪ Analyze two global compression models: non-parametric compression model and Bernoulli global compression model. Next, the thesis analyzes in detail the local hyperbolic compression model, thereby proposing Bernoulli local compression and interpolation compression models for IFID.

▪Analysis of effects affecting index size

IFID island file: Combine text, trace word origin, omit word.

**CHAPTER 3 - FINDING INFORMATION**

**3.1 INTRODUCTION**

The author investigates two types of queries. The first is a traditional Boolean (BQ) query. The second is the rank query (RQ).

**3.2 BOOLE QUESTIONS**

The simplest query type is the BQ, where the terms are combined with the AND, OR, and NOT operations [31], [45], [48], [74], [82], [83], [86], [102], [126], [130], [145], [154],

[159]. The query process using an IFID is relatively direct. Vocabulary searched for each term; each IL is searched and decoded; and the lists are merged, intersected, matched, or offset as appropriate. Finally, such indexed documents are searched and displayed to the user as a list of answers.

**3.2.1 Query the Board of Directors**

The author investigates in detail the process of the council's committee. Assume the query is an association, consisting of terms connected to the AND operation as follows: t1 AND t2 AND ... AND tr and an assembly with r terms being processed.

**3.2.2 Unmatched BQ query**

So far, the author has only considered the type of board of directors. The other common form is an association of selections, where a selection number is specified for each of its components, essentially a union: (text OR data OR information) AND

(search OR seek) AND

(retrieval OR indexing)

**3.3 RQR RANKED QUESTIONS**

To date, most of the IR information retrieval systems available in the library use Boolean BQ queries, but incorrectly handle complex, non-assembled Boolean queries. BQ is not the only method of finding information. If a certain subset of the exact documents being searched is known in advance, then the BQ is certainly relevant, which is why the BQ is successful in bibliographic search systems. However, information requests are often less precise. Therefore, it is sometimes useful to be able to specify a term list that indicates well the documents involved, even though they do not need to be all present in the document search. Here, the study author assigns a similarity to each document in a way that requires a close match of a query.