Approaches to the Word Separation Problem

Natural Language Processing Foundation

2.2.1. Semantic search

Semantic search is a term used to describe search engines that can understand natural language queries and can go beyond that by understanding the context of the searcher at the time they type the query. A simple example would be: if the query is “jaguar fuel consumption” then there is a high chance that the word “jaguar” means they are looking for information about a car rather than an animal.

Key elements of semantic search:

- Understand natural language

- Context of the query stream

- User context

- Entity recognition

Figure 11: Four important factors in Semantic search

Having a better understanding of user intent maximizes the chances of users having the best search experience. And that’s how search engines are known.

in the world like Google, Bing, Baidu... can bring you the information closest to what you need.

2.2.2. TF-IDF

TF-IDF stands for term frequency – inverse document frequency, which is an index to evaluate the importance of a word to a text or a document set. TF-IDF is often used in information retrieval and data mining problems [2] [3] [4].

2.2.2.1. TF – term frequency

This is the frequency of a word appearing in a text, calculated using the formula:

𝑡𝑓(𝑡, 𝑑) =

𝑓(𝑡, 𝑑)

max {𝑓(𝑤, 𝑑) ∶ 𝑤 ∈ 𝑑}

In there:

- f(t,d): number of times word t appears in text d

- max {𝑓(𝑤, 𝑑) ∶ 𝑤 ∈ 𝑑} : the maximum number of occurrences of any word in the text

- the value of tf(t,d) will be in the range [0, 1]

2.2.2.2. IDF – inverse document frequency

This is the inverse frequency of a word in a corpus, this index is intended to reduce the value of common words. Each word has only one IDF value in a corpus and is calculated using the formula:

In there:

𝑖𝑑𝑓(𝑡, 𝐷) = 𝑙𝑜𝑔

|𝐷|

|{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}|

- |D|: total number of documents in set D

- |{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}| : number of documents containing a certain word, with the condition that t appears in document d, i.e.: 𝑡𝑓(𝑡, 𝑑) ≠ 0 . If the word does not appear in any document in

the denominator will be 0 => division by zero is invalid, so people often replace it with the denominator 1 + |{𝑑 ∈ 𝐷 ∶ 𝑡 ∈ 𝑑}| .

2.2.2.3. TF-IDF value

𝑡𝑓𝑖𝑑𝑓(𝑡, 𝑑, 𝐷) = 𝑡𝑓(𝑡, 𝑑) 𝑥 𝑖𝑑𝑓(𝑡, 𝐷)

Words with high TF-IDF values are words that appear frequently in this document, and appear less frequently in other documents. This helps filter out common words and keep high-value words (keywords of that document).

2.2.3. Word segmentation

Word segmentation is the problem of splitting an input string of characters into independent words.

In English and some other languages that use the Latin alphabet, spaces are a good way to separate words in sentences. However, not all languages have such separators. In Thai and Lao, phrases and sentences are separated but words are not. In Chinese and Japanese, there are sentence separators but no word separators. In Vietnamese, the word separators are separated by syllables. Therefore, the problem of word segmentation is quite difficult.

For example:

Language

Input text	Result
English	I am looking for my pen.	I am looking_for my_pen
Vietnamese	Students learn biology	student student_study
Japanese	I will continue to work	Next
…

Maybe you are interested!

Table 3: Word Segmentation in Different Languages

There are many approaches to solve this problem [5].

Figure 12: Approaches to the word segmentation problem

These methods are classified into three main groups:

- Dict-based: is to create a dictionary and split the input text into words in that dictionary. The two most effective approaches of this method are Maximun Matching and Longest Matching.

- Statiscal: relies on the use of a very large labeled dataset. Some popular methods are N-gram Language Model [6], Hidden Markov Model (HMM) [7], Conditional Random Fields (CRFs) [8] and Maximum Entropy (ME) [9].

- Hybrid: is a combination of different methods to take advantage of the advantages of each method and limit their disadvantages. Many hybrid models have been published and applied to many different languages. They include dictionary-based techniques (Maximun Matching, Longest Matching), statistics-based (N-

gram, CRFs, ME) and machine learning algorithms (Support Vector Machines - SVMs, Genetic Algorithm - GA) [10] [11] [12].

2.2.4. Part of speech tagging (POStag)

POSTag, also known as grammatical tagging, is the process of marking a word in a text as corresponding to a part of speech, based on both the definition and the context, the relationship of that word to surrounding words and related words in phrases, sentences, paragraphs. For example, some word classes in English are nouns, prepositions, pronouns, conjunctions, verbs, adjectives, adverbs, etc. One of the mandatory preprocessing steps of POSTag is word segmentation.

The problem with POS tagging is that it handles ambiguity, choosing the right tag for the right context. For example, the word “stone” in the sentence “This horse is made of stone” is a noun, but in the sentence “The children are playing soccer” it is a verb.

Some examples of POSTag:

- Input text: Student student student_study

- Result after labeling word types: Student/N learn/V student/N (Where /N is noun, /V is verb)

Support tools

2.3.1. VnCoreNLP

VnCoreNLP [13] is a Vietnamese language labeling toolkit, providing natural language processing tools such as: Word Segmentation, POS tagging, Named Entity Recognition and Dependency Parsing.

Features of VnCoreNLP:

- Accuracy: VnCoreNLP is a Vietnamese language processing toolkit with high accuracy. With a standard data set, VnCoreNLP gives a result higher than all previously published tools.

- Process large data sets in very fast time.

- Easy to deploy.

Figure 13: VnCoreNLP processing flow

2.3.2. Word2vec

Word2vec is a natural language processing technique. This algorithm uses a neural network model to learn associations from a large data set. After training a large enough set, this model can detect synonyms or can be applied to the problem of suggesting words for a word or a piece of text.

This technique assigns a vector value (a list of specific numbers) to each individual word. These vectors are calculated so that: the more similar two words are semantically, the higher the cosine similarity between the two vectors representing them.

Word2vec has two models, skip-grams and CBOW:

- Skip-grams is a model that predicts surrounding words. For example, when applying a window size of 3 to the sentence “I love you so much”, the set {(I, love), love}, {(love, so), you},

{(you, much), so}. When given the input word “love”, this model will predict the surrounding words “I” and “you”.

- CBOW (continuous bag of word), this model is the opposite of Skip-grams, that is, the input will be words and the model will calculate to predict words related to the input words.

In our experiments, CBOW trains data faster but the accuracy is not higher than skip-grams and vice versa, and we only apply one of the two models to train the data set.

2.3.3. Elasticsearch

As introduced, Elasticsearch is an open source distributed search engine for all types of data including text, numeric, geospatial, structured and unstructured. Elasticsearch provides RESTful APIs to perform tasks on separate servers so it can be easily integrated with any system.

Chapter 3

Knowledge search system on Wikihow domain

Calculate similarity between two sentences

In this thesis, I propose a method to calculate the similarity between two sentences based on the Jaccard Similarity index.

The Jaccard index, also known as the similarity coefficient, can be used to calculate the similarity between finite sets of samples (in which the elements are not duplicated) and is defined as the size of the intersection divided by the size of the union of the sample sets.

The mathematical expression of the index is represented as follows:

𝐽(𝐴, 𝐵) =

|𝐴 ∩ 𝐵|

|𝐴 ∪ 𝐵|

|𝐴 ∩ 𝐵|

|𝐴| + |𝐵| − |𝐴 ∩ 𝐵|

The value will be in the range [0, 1] and equal to 1 when these two sets have identical elements.

This index can be applied to calculate the similarity between two sentences when we consider each word in the sentence as an element in a set of words that are not duplicated. But if we simply find the elements in the two sets of words that need to be compared to see if they appear in the other set of words to calculate the number of intersecting words between the two sets, it will not solve the problem of synonyms. Here I want to say that, to calculate the similarity of two sentences based on Jaccard Similarity, the intersection part , in addition to calculating the number of words appearing in both sentences based on characters, must also calculate the semantic similarity. Therefore, I propose a formula to calculate the similarity for two character strings X and Y as follows:

Comment