Integrating word morphology information into English-Vietnamese statistical machine translation system - 1



HO CHI MINH CITY NATIONAL UNIVERSITY

UNIVERSITY OF NATURAL SCIENCES

Maybe you are interested!


NGUYEN THI NGOC MAI

Integrating word morphology information into English-Vietnamese statistical machine translation system - 1


INTEGRATING MORPHOLOGICAL INFORMATION INTO ENGLISH - VIETNAMESE STATISTICAL MACHINE TRANSLATION SYSTEM


MASTER'S THESIS IN COMPUTER SCIENCE


Ho Chi Minh City - 2010


INDEX

TABLE OF CONTENTS 1

LIST OF TABLES 4

LIST OF FIGURES 5

CHAPTER 1: INTRODUCTION 6

1.1. Problem statement 6

1.2. Approach of topic 8

1.3. Content of the thesis 9

CHAPTER 2: OVERVIEW 11

2.1. Statistical machine translation 11

2.1.1. Statistical machine translation based on word 11

2.1.2. Statistical machine translation model based on 19 languages

2.1.3. Factored statistical translation model (Factored SMT) 26

2.1.4. Syntax-based statistical machine translation model 29

2.2. Standards for evaluating translation quality 31

2.2.1. BLEU (Bilingual Evaluation Understudy) 32

2.2.2. NIST 32

2.2.3. TER (Translation Error Rate) 32

CHAPTER 3: 33

DIRECTIONS FOR INTEGRATING LINGUISTIC KNOWLEDGE INTO STATISTICAL MACHINE TRANSLATION 33

3.1. Using linguistic knowledge for preprocessing 33

3.1.1. Using Syntax Information 34

3.1.2. Using information from type 36

3.1.3. Using the law of morphological transformation from 37

3.2. Integrating knowledge into machine translation systems 39

3.2.1. Integrating morphological information into the translation model 39

3.2.2. Integrating syntactic information into the translation model 40

3.2.3. Integration into language model 41

CHAPTER 4: MODEL OF THE TOPIC 42

4.1. Integrating morphological information from English 43

4.1.1. Information from type 43

4.1.2. Information on the inflection of word 44

4.1.3. Using the 45-order transition rule

4.2. Add morphological information from Vietnamese 50

4.2.1. Boundary information from 50

4.2.2. Information from type 51

4.3. Adding morphological information to English and Vietnamese 52

CHAPTER 5: EXPERIMENT AND EVALUATION 54

5.1. Corpus 54

5.2. Tools 55

5.3. Experiment 55

5.3.1. Integrating morphological information in English sentences 55

4.4. Summary of experimental results 74

CHAPTER 6: CONCLUSION 76

REFERENCES 78

APPENDIX 82

A. Contrast English - Vietnamese morphology (declension) 82

B. Translation results of some models 82


LIST OF TABLES

Table 2.1. Table showing alignment from tabular form 15

Table 5.1 Information about the corpus 59

Table 5.2. Translation results of systems integrating morphological information from words into English sentences

...................................................................................................................................60

Table 5.3. Translation results of the 63-word order conversion systems

Table 5.4. Translation results of systems integrating morphological information from words into Vietnamese sentences

...................................................................................................................................64

Table 5.5. Number of alignment links from the 65 models

Table 5.6 Translation results of the integrated translation system of Vietnamese word types 68

Table 5.7. Translation results of systems integrating morphological information into English and Vietnamese sentences 71


LIST OF IMAGES

Figure 2.1. Statistical machine translation model 12

Figure 2.2. Representation of alignment from the 14-link form

Figure 2.3. Illustration of the process of improving the alignment from 19

Figure 2.4. Example of statistical translation based on 20 languages

Figure 2.5. Factored SMT translation model 27

Figure 4.1. General model of the thesis 43

Figure 4.2. Lexical language model 49

Figure 4.3. Language model with 49 parts of speech

Figure 4.4. Factored SMT model integrated from type 50

Figure 4.5. Factored SMT model integrating prototype and type 51

Figure 4.5. Factored SMT model integrating morphological information from 51

CHAPTER 1: INTRODUCTION


1.1. Problem statement

Machine translation, also known as automatic translation, has been and is being interested by people today. Researchers apply knowledge to exploit the computing power of computers and create applications to serve people in the era of information technology development. When communication and quick information capture will create many opportunities for people to succeed, automatic translation programs will be tools to help them overcome language barriers, help them convert languages ​​quickly and save effort. Machine translation is a very interesting field, attracting the attention of many research groups around the world. However, each language itself is very complex, often ambiguous. On the other hand, there are always differences between languages, from vocabulary to structures to form sentences. Building a machine translation system that can understand context, eliminate ambiguity and translate close to humans is still a big challenge.

For Vietnamese, there are currently many groups investing in translation systems with many different approaches:

- Research group of Associate Professor, Dr. Dinh Dien (University of Science - Ho Chi Minh City National University): The group's research project is based on learning conversion rules from bilingual corpora.

- Research group of Associate Professor, Dr. Phan Thi Tuoi (Ho Chi Minh City University of Technology): The group uses probabilistic parsing method to translate English-Vietnamese and Vietnamese-English texts.

- Research group of Dr. Le Khanh Hung Softex (Software Technology Department - Institute of Technology Application - Ministry of Science and Technology of Vietnam): the translation system has been put into practical use and commercialized the product (http://vdict.com). EVTRAN is a machine translation system based entirely on rules, using


uses hand-built rules to translate text from English to Vietnamese. Since 2006, EVTRAN 3.0 (called Ev-Shuttle) has been able to translate bidirectional text English-Vietnamese and Vietnamese-English. Since the translation system is rule-based, the translation result depends largely on whether the input sentence matches the established rules.

- The ERIM project group of Danang University of Technology in collaboration with GETA - Grenoble University of Technology, tested English-Vietnamese and French-Vietnamese translation by Doan Nguyen Hai (http://www.latl.unige.ch/vietnamese/) at LATL.

- Google Translate (www.translate.google.com): Supports more than 50 languages ​​including Vietnamese. Uses statistical machine translation method based on bilingual corpus. Fast translation speed and has user interaction feature to improve translation quality for next time.

- Machine translation on Xalo.vn (www.dich.xalo.vn): provides a one-way online translation service from English to Vietnamese, developed by Tinh Van Technology Joint Stock Company, supporting translation by field, and allowing users to edit and comment on the translation content to improve translation quality.

- Lac Viet (the company that developed and launched the Lac Viet dictionary www.vietgle.vn/tratu/dich-tu-dong): only supports translation from English to Vietnamese with additional specialized translation (IT, mathematics, medicine and accounting) and better translation support by users.

Because they are built on different models, the systems produce different translation quality, depending on the input sentence format.

Rule-based systems are quite efficient in translation because they use linguistic knowledge such as syntactic and semantic information. However, it is difficult for computers to parse sentences with complex semantics accurately. On the other hand, it is difficult to build a set of syntactic rules and transformation rules that can cover all cases, requiring the implementer to have deep knowledge of the language.

In contrast, Statistical Machine Translation (SMT) is entirely based on statistical results from bilingual corpora. The intermediate results of

Comment


Agree Privacy Policy *