Research Methods and Research Subjects

Data from this industry will continue to increase rapidly and is easily collected due to its availability on the Web. Data mining applications in the retail industry aim to build models to help identify customer purchasing trends, helping businesses improve the quality of products and services to increase customer satisfaction and retain customers well. Below are some applications of data mining in the retail industry:

Data mining on customer data warehouse.

Multidimensional analysis on customer data warehouse on sales, customers, products, time and region.

Analyze the effectiveness of sales and marketing campaigns.

Customer relationship management.

Introduce and advise suitable products to customers.

Telecommunications industry

The telecommunication industry is one of the emerging industries, providing many services such as mobile phones, internet, video transmission, etc. Due to the strong development of computer technology and computer networks, telecommunications is developing at a very fast pace. This is the reason why data mining becomes very important in this field.

Data mining in the telecommunication industry helps in identifying telecommunication patterns, detecting telecommunication frauds, better utilization of resources and improving the quality of telecommunication services. Some of the applications of data mining in this industry are as follows:

Multidimensional telecommunications data analysis.

Building fraud detection models.

Detecting anomalies in telecommunications transactions.

Analysis of customer telecommunications service usage behavior.

Using visualization tools in telecommunications data analysis.

Biodata Analysis

Biological data mining is a very important part of the field of Bioinformatics. Following are some applications of data mining in biology:

Indexing, searching for similarity, anomalies in Gen database.

Building models to mine genetic networks and the structure of genes and proteins.

Building visualization tools in genetic data analysis.

Detect illegal intrusion

Intrusion is an action that threatens the integrity, confidentiality, and availability of network resources. In the world of connectivity, security has become a major issue for the survival of the system. With the development of the internet and the availability of tools and techniques to assist in intrusion and cyber attacks, the requirement to control illegal access is very important to ensure the stability of the system.

Here are some applications of data mining that can be applied to intrusion detection:

- Develop data mining algorithms for intrusion detection.

- Association, correlation and difference analysis for intrusion detection.

- Analyze data streams to detect anomalies.

3. Mining the array of frequent closed itemsets

Finding association rules from millions of transactions in large databases is currently a difficult task in the field of data mining. Frequent itemsets and closed itemsets are the key to mining association rules. Association rules can be efficiently mined from the Closure of Frequent Itemsets.

Non-redundancy, precision, time and memory usage are the factors that need to be considered while developing an algorithm to find useful association rules. Building a closed itemset is to build a parent-child relationship (directly) between the frequent closed itemsets. Thus, it saves time when browsing the set to generate rules.

4. Scientific and practical significance of the topic

- The BVCL algorithm is more innovative than the CHARML algorithm as well as the sequential browsing algorithms of the item set in the following main points:

o The DSBV structure stores parent set information in bit form, so transmitting information to sets in the array when calling recursively is fast.

o The way the subsume list and non-subsume list are organized makes the algorithm skip quite a few steps when recursing with the sets in the subsume list.

- The new algorithm improves the efficiency of mining this array of frequent closed itemsets, contributing to solving the problem of association rule mining with a wide range of applications, including customer shopping behavior analysis, web retrieval sequences, scientific experiments, disease treatment, natural disaster prevention, and protein formation, etc.

5. Research methods and research objects

Research method :

- Document research method : based on published documents of researchers on algorithms for mining frequent itemsets, closed itemsets, and Scatter mining: Apriori-Gen, FP-tree, Charm, CharmL, MG-Charm, DCI-Closed, LCM, DBV-Miner, GENCLOSE, NAFCP. Analyze how to use DBV, DSBV data structures, how to organize the database (horizontal or vertical), how to generate new candidate patterns, Scatter mining techniques... and the development trends of algorithms.

- Experimental method : Implement and experiment the methods proposed in the thesis to determine the correctness, feasibility and development compared with the published methods of domestic and foreign authors related to the thesis.

- Statistical and data analysis methods : Collect and synthesize data during the experimental process to analyze and evaluate, thereby recognizing, detecting, and selecting advantages to promote, finding ways to overcome limitations, and at the same time combining the collected related information into

a complete logic to propose a new algorithm that simultaneously mines frequent closed itemsets and their generators faster and is more memory efficient (with experimental comparison).

 Research object : BVCL algorithm exploits the array of closed frequent itemsets, dynamic bit vector structure - DBV, dynamic bit vector superset structure DSBV used to store and transmit information of supersets, mapping techniques, indexes to increase the search speed of the algorithm.

6. Difficulties and Challenges

Given a database D and a user-defined minSup threshold , finding all frequent closed itemsets is a significant time challenge due to the exponential complexity of mining.

The length of the bit vector increases with the number of transactions in the database. This consumes a lot of memory. Dynamic bit vector structures save a lot of memory for sparse databases, however, for dense databases, the savings are insignificant. On the datasets used in this thesis, the memory savings are more than 50% compared to static bit vector structures.

7. Objectives and scope of the thesis

The ultimate goal of frequent itemset mining is to mine association rules and apply the results in practice.

In the traditional way of mining association rules, finding all association rules from the database that satisfy minSup and minConf is disadvantageous when the number of frequent itemsets is large. Therefore, there is a need for a suitable method to mine with a smaller number of rules but still ensure full integration of all rules of the traditional mining method. One of those approaches is to mine the most essential rules, only keeping the rules with the minimum left side and the maximum right side (according to the parent-child relationship).

The objective of the thesis is to focus on researching, analyzing the advantages and limitations of closed itemset mining algorithms, closed itemsets on Arrays. With the desire to reduce the time for rule mining, using parent-child relationships on Arrays to reduce the cost of considering parent-child relationships and thus reduce the time for rule mining.

Thereby, it can be seen that the BVCL algorithm with DBV, DSBV data structures is effective and powerful, suitable for research and application. It is proposed to integrate and improve the original algorithm to simultaneously exploit Closed Frequent Sets and Generators, which is necessary for future association rule mining work. Finally, compare the experimental results of the improved BVCL algorithm with the original algorithm and CharmL, MGCharm algorithms on synthetic and real databases.

8. Contribution of the thesis

The thesis has improved the author's original algorithm by combining and optimizing the simultaneous exploitation of the generators into the original algorithm. The improved algorithm is not faster than the original algorithm in terms of mining the closed frequent itemset array. However, the exploitation of the generators is very important in the process of mining association rules. If the original algorithm separately mines the closed frequent itemset array and the generators, the algorithm that the thesis improves is much faster and more effective. The comparison issue will be mentioned separately in the comparison and evaluation section.

Chapter 2: THEORETICAL BASIS

1. Problem overview

D: are transactions in the Database.

Given a set of transactions T = { t 1 , t 2... t n }. Each transaction t i (1 ≤ i ≤ n) is identified by a key called Tid.

Tidset(X): A set of transactions containing set X.

Itemset(Y): Represents a set of Items that appear in the transactions of Tidset (Y).

𝑖=0

𝑇𝑖𝑑𝑠𝑒𝑡 ( 𝑋 ) = ⋂ 𝑘 𝑡𝑖𝑑𝑠𝑒𝑡(𝑥 𝑖 ) , 𝑣 ì 𝑖 {𝑥1, 𝑥2, . . . , 𝑥𝑘} ⊆ 𝐼 , I is Itemset

𝐼𝑡𝑒𝑚𝑠𝑒𝑡 ( 𝑌 ) = ⋂ 𝑝 𝐼𝑡𝑒𝑚𝑠𝑒𝑡(𝑦 ), 𝑣 ì 𝑖 𝑌 = {𝑦1, 𝑦2, . . . , 𝑦𝑝} ⊆ 𝑇 , T is Tidset

𝑗=0 𝑖

And: I = { i 1 , i 2 … i m }

The number of transactions in tidset (X) corresponding to the support of X is represented by sup (X).

An Itemset X is called frequent if Sup(X) ≥ Minsup.

The support of the association rule X =>Y is the frequency of transactions containing all items in both sets X and Y. For example , the support of rule X =>Y is 5% which means that 5% of transactions X and Y are purchased together.

Minsup: is the minimum support that must be determined before generating an association rule.

An itemset X is called a closed itemset if no itemset covering it has the same frequency. In other words. There does not exist an itemset X', X sub X', such that sup (X) = sup (X').

Consider X and Y as two closed frequent itemsets.

- Y is called a frequent closed superset (FCS) of X if X ⊂ Y.

- Y is a minimal closed superset of X if there does not exist a closed itemset Z such that X ⊂ Z ⊂ Y.

- The set G is called the generator of the closed set X, if and only if G ⊆ X

and Sup(G) = Sup(X).

If Y is the minimal cover of a closed set X, then node Y is considered a child of node X, and there is an edge from X  Y.

An association rule (AR) is represented as X ⇒ ( Y − X ): X, Y are 2 frequent itemsets and X ⊂ Y.

Mining Frequent closed itemset lattice (FCIL) is to arrange the Nodes of frequent closed itemsets, connect the Nodes in pairs in the Lattice and create a Parent-Child relationship. We also simultaneously mine the generators corresponding to the frequent closed itemsets in the Lattice.

2. Approach to mining frequent closed itemsets

A large number of research works focus on developing algorithms to mine frequent closed itemsets. Most of these techniques can be classified into four categories depending on the particular strategy.

- First is the testing and development strategy according to APRIORI-GEN [19]. The procedure for testing the closure of a set of items.

- The second is to use the FP-tree data structure [20].

- The third is to use hybrid search techniques. CHARM [9] is a typical example of this approach, which exploits not only item-space but also tid-space.

- The fourth is DCI-Closed [21] and LCM [22], which are different from the above three. Both of these approaches overcome the testing cost problem to eliminate the generation of duplicates of closed itemsets. They traverse the search space depth-first and generate new frequent closed itemsets by expanding previously discovered closed itemsets. Furthermore, they do not require memory storage for previously discovered closed itemsets. LCM [22] applies the co-occurrence and hybrid version of diffset techniques [9] to compute frequentness and closeness based on the characteristics of the dataset.

FCI mining algorithms can be divided into two basic categories depending on the database representation, horizontal (tid × itemset) and vertical (item × tidset). It is observed from the existing literature [23] that algorithms using vertical database representation outperform those using horizontal representation. Table 2.2 is an encoded representation of Table 2.1 where the Items are represented by letters, and are represented horizontally. ECLAT [10]. Zaki introduced vertical database.

Table 2.1: Book store transaction database.

TID

Itemset
t 1	Algorithms, Logic Design, Data Mining, Programming in C, Networking
t 2	Logic Design, Compiler, Programming in C, Microprocessor
t 3	Algorithms, Logic Design, Data Mining, Networking, Programming in C
t 4	Algorithms, Logic Design, Compiler, Programming in C, Microprocessor, Networking
t 5	Algorithms, Logic Design, Compiler, Data Mining, Programming in C, Microprocessor, Networking
t 6	Compiler, Data Mining, Microprocessor, Networking, Human Computer Interaction, Logic Design