Vietnamese linking grammar model

Natural language processing on computers is one of the difficult problems of information technology. Research on natural language processing has been started since the 40s of the 20th century, right after the appearance of electronic computers. Although started later, Vietnamese language processing has developed very strongly in recent years due to the explosion of information on the Internet with a series of requests for searching, document translation, information dissemination, training. creating, teleconference… The number of researchers pursuing this field has increased rapidly, approaching in both major directions: speech processing and word processing. Due to the scope of the topic, the thesis only deals with a number of related issues in the word processing branch.

Parsing is an important step to solve many other problems, so Vietnamese parsers were built very early. The first are the context-free grammar-based parsers with traditional methods: the CYK parser by Le Thanh Huong and colleagues [12], the parsers by the CYK method of Le Thanh Huong and colleagues [12]. according to the Earley method of Phan Thi Tuoi [27], Nguyen Gia Dinh and colleagues [5]. To solve the problem of ambiguity, Le Thanh Huong’s group used a context-free grammar, lexicalized associative probability [22], and a paragraph structure grammar that is center-oriented [15]. Many other grammar models have also been built for Vietnamese with to expand the class of languages represented: the grammar connecting the lexicalization tree built by Nguyen Thi Minh Huyen and colleagues [20], [101] allows to represent the class of context-sensitive languages, stroke structure and unified grammar used by Tran Ngoc Tuan group [26], [122], [123] allowing to represent the largest language class according to the hierarchy of languages. Chomsky[63]: language class of type 0.

Automatic translation is a difficult field but has great practical application. Currently, Vietnamese researchers have tested a number of automatic translation systems based on different approaches. The VCLEVT of the University of Natural Sciences, Vietnam National University, Ho Chi Minh City, can be mentioned with the BTL approach – learning the law of converting from bilingual corpus [3]. The first Vietnamese translation system to be commercialized was Nacentech’s EVtran – VEtran following a rule-based approach [10]. Another translation system with good quality is the Vietgle system specializing in English – Vietnamese translation by Lac Viet. There are also other machine translation systems such as LVT system of Vietnam National University of Technology, Hanoi [93], Vietnamese – English statistical machine translation system using probabilistic parsing of Ho Chi Minh City University of Technology. Chi Minh [124]. It is also impossible not to mention the Google Translate translation system with a statistical approach to Google’s huge corpus. In general, automatic translation products are mainly in the English – Vietnamese direction. The quantity and quality of Vietnamese – English translation systems is still limited.

Regarding text mining on the Internet, many Vietnamese researchers are interested in text representation fields such as Ho Tu Bao [29],[33]; web mining, semantic web such as Cao Hoang Tru [117], Ho Tu Bao[63]; text summarization such as Le Thanh Huong [66], Ha Thanh Le group [15]… However, not much research has been done on Vietnamese texts like the text summarization system of Ha Thanh Le group. [15], the system of extracting Vietnamese website content of Do Phuc group [19].

Due to the characteristics of word structure, word separation and labeling is a mandatory preprocessing stage in Vietnamese language processing systems. The word decomposition tool vnTokenizer was developed by Nguyen Thi Minh Huyen, Le Hong Phuong and colleagues, using finite automaton combined with regular expression analysis to identify word strings [102]. The case of ambiguity is solved by a heuristic algorithm, which prioritizes the decomposition for the results containing the words with the largest length. This method achieves high accuracy with the sample corpus (over 98.5%) [116]. The JVnSegmenter magnetic separator of Phan Xuan Hieu group [121] using CRF and SVM technology also gave 94% results. In addition, Le An Ha’s magnetic separator [60] can be used to calculate the maximum probability and likelihood. The word classification problem is often solved together with the word separation problem. Together with the JVnSegmenter suite, its authors built the labeler from JVnTagger using CRF and maximum entropy [7]. The vnTokennizer suite also comes with vnQTG [13]. Some studies by Vietnamese authors also focus on de-ambiguation of word meanings such as Le Anh Cuong [45], [46], Dinh Dien [48].

The corpus is a very important resource in Vietnamese language processing. State-level projects KC.01-03, KC.01.01/06-10 have collected a Vietnamese corpus from electronic articles. Currently, a corpus of 1 million syllables has been separated from words, 10,000 sentences are labeled with categories, and a Vietnamese treebank with 10,000 sentence analysis has been built. These are also great contributions, significantly facilitating the research on automatic Vietnamese language processing.

Regarding bilingual materials: the corpus of bilingual books and newspapers is also very significant. However, that corpus is difficult to support for automatic processing, because preprocessing operations such as sentence and word level alignment have not been performed. The popular electronic English-Vietnamese bilingual corpus (with 1-to-1 translation, with language labels) is popular with Cao Hoang Tru’s English-Vietnamese bilingual corpus EVC is the first officially published work in the country [24], [25], the bilingual corpus of Dinh Dien group was first published abroad [47]. There has been a detailed study on building and exploiting a bilingual English – Vietnamese corpus with language labels of Dinh Dien [48]. There are also other results on building corpus for word processing by groups of Nguyen Thi Minh Huyen [36], [37], Phan Huy Khanh [73]. The topic KC.01.01/06-10 has collected 100,000 English-Vietnamese bilingual sentences with sentence level alignment, of which 20,000 are in the field of informatics and 80,000 are in the economic and social fields. Bilingual Vietnamese – English corpus is still poor, there are no significant sample corpus.

Some electronic dictionaries have been built, mainly for searching on computers, but most of these dictionaries have not been used in automatic processing. The most significant is the Vietnamese dictionary of the topic KC.01.01/06-10 [16] built on the LMF model with three packages: morphology, syntax, and semantics. The dictionary presents quite comprehensive information related to words and syntax. Some bilingual dictionaries are provided free of charge, such as the English – Vietnamese dictionary of the topic KC.01.01/06-10 with nearly 60,000 entries, the Vietnamese – English dictionary also provided by the above topic with more than 11,000 entries. Ho Ngoc Duc’s English – Vietnamese dictionary includes 110,000 entries, Vietnamese – English dictionary includes 23,000 entries.

Maybe you are interested!

The above is a part of the picture of the situation of automatic text processing research in Vietnamese with significant development in recent times. Compared with English, European languages, or Chinese, Japanese, Korean, it can be seen that the resources for processing Vietnamese are still poor. Despite the current dominance of machine learning and statistical methods, few studies have completely decoupled syntactic representation models. The reference to the syntactic structure of the source text as well as the target text appears in the translation systems of Dinh Dien group [3], the group of Ho Chi Minh City University of Technology [124], the research group at JAIST [115]. Using statistical learning methods combined with syntactic representation will give products of much better quality, for example in the field of machine translation [115]. Thus, the problem of syntax representation is still a very important issue in Vietnamese language processing.

The context-free grammar model is the most popular model for representing Vietnamese syntax and parsing according to the famous methods CYK, Earley [12], [27], [5]. This model is also used for some machine translation systems [124].

The division of words into classes without regard to the lexical features of a classical grammar grammar can cause the parser to accept many sentences that are never used in practice, e.g. Vietnamese “I bought two paddy”. This sentence, does not exist in Vietnamese because the word “rice” in never goes directly after the word count. This phenomenon is also very common in other languages. The trend of lexicalization of grammars is of interest to many researchers. Many lexicalized grammar models have been built for natural languages such as lexicalized context-free grammar, lexicalized functional grammar, paragraph-centred structure grammar, word tree-connected grammar. lexicalization, combinatorial category grammar, associative grammar… Currently, the lexicalization trend has also affected Vietnamese grammars. Probabilistic lexicalized context-free grammar models [22] and lexicalized tree-connected grammars [20] have been developed for Vietnamese. However, only a few grammars such as combinatorial and associative category grammars are completely lexicalized, that is, there are separate rules for each item [112]. The fully lexicalized model allows to specify many syntactic and lexical exceptions of Vietnamese

The large size of the non-terminating symbol set makes sentence analysis in context-free grammars complicated. Therefore, when using parsing trees for other purposes such as machine translation, language generation requires many steps of processing according to hierarchical levels in the tree. Moreover, in order to find the relationship between two words in a sentence according to the context-free model, it is necessary to overcome a large distance, even tracing the connections to the root node with a large amount of time. In Vietnamese, in many cases, the relationship between words is extremely important because it can give information about the number of nouns, tenses, verb forms, or many other types of relationships such as property relations. property, material relationship…

The dependency approach is now the dominant trend for syntax representation. The first advantage of the dependency grammar is that there is no non-terminating symbol set. A dependency tree shows a direct relationship between words in a sentence, much simpler than a structure tree. When using labeled dependencies, the dependency model encodes the predicate-modifier structure directly. Therefore, it is possible to translate (understand) each paragraph in a sentence separately.

The non-projective dependent grammar model has the feature that the dependency structure is independent of word order, which is very suitable for languages with free word order. Of course, the dependent grammar model still works well for languages that have a fairly tight word order. Therefore, parsers built on the dependency model were developed for most of the popular languages in the world, starting with the English parsers of Collins [44], Stanford University dependency parser. Dependent parsers for other languages: Candito’s French [39], [40], Bogulavsky’s Russian [98], Lai Bong Yeung Tom’s Chinese, Changning Huang [118], Japanese by Matsumoto and colleagues [99], [125], Korean by So Young Kwon [78] were built. Many Southeast Asian languages are also parsed dependently such as Indonesian with Kamayani and Purwarianti’s parser [72], Thai with Tongchim’s parser [119], Tagalog (Philippines) with the analysis set of Magulimotan and Matsumoto [85]. The dependent grammar model is also very useful for applications such as text summarization [91], [108], information extraction [42], machine translation [49], [55]…

Vietnamese linking grammar model - 1

Gửi bình luận