Learn active learning methods and apply them to spam filtering problems

HANOI NATIONAL UNIVERSITY

UNIVERSITY OF TECHNOLOGY

Maybe you are interested!

NGUYEN THI HONG HAU

LEARN ABOUT ACTIVE LEARNING METHODS AND

APPLICATION TO SPAM FILTERING PROBLEM

Industry: Information Technology Major: Information Systems Code: 60 48 05

MASTER'S THESIS

SCIENTIFIC INSTRUCTOR: DR. NGUYEN TRI THANH

Hanoi - 2011

INDEX

LIST OF DRAWINGS 3

LIST OF TABLES 4

CHAPTER I: INTRODUCTION 5

1.1 Topic introduction 5

1.1.1 Reasons for choosing topic 5

1.1.2 Objectives of topic 6

1.1.3 Stages of project implementation 7

1.2 Structure of the thesis 8

CHAPTER II - OVERVIEW OF ACTIVE LEARNING 10

2.1 Introduction to active learning 10

2.2 Active learning method 13

2.3 Active learning scenario 15

2.3.1 Stream_based Sampling 15

2.3.2 Pool-based Sampling 15

2.4 Inquiry strategies in active learning 15

2.4.1 Uncertain Sampling 16

2.4.2 Council-based query 17

2.5 Comparison of active learning and passive learning 17

2.6 Application domains of active learning 18

2.7 Conclusion 19

CHAPTER III - SOME ACTIVE LEARNING ALGORITHM 20

3.1 Perceptron-based active learning 20

3.1.1 Introduction 20

3.1.2 Perceptron algorithm 20

3.1.3 Improvements to the perceptron update step 23

3.1.4 Active Correction Perceptron 25

3.2 Active Learning with SVM 27

3.2.1 Introduction 27

3.2.2 Support vector machine 27

3.2.3 Version space 30

3.2.4 Active Learning with SVM 33

3.3 Conclusion 39

CHAPTER 4. APPLYING ACTIVE LEARNING TO SPAM FILTERING PROBLEMS 40

4.1 Introduction 40

4.2 Active learning in spam filtering problem 41

4.3 Testing and results 43

4.3.1. Install the test program 43

4.3.2. Data collection and presentation 45

4.3.3. Building a presentation program and preprocessing data 48

4.3.4. Test results 51

4.4 Conclusion 57

CONCLUSION 58

REFERENCES 60

LIST OF DRAWINGS

Figure 2.1 General diagram for passive learner Figure 2.2 General diagram for active learner Figure 2.3 Overall diagram of active learning Figure 3.1 Standard perceptron algorithm

Figure 3.2 Standard percepron improvement algorithm

Figure 3.3 The active learning rule is to query labels for points x in L. Figure 3.4. The active version of the modified Perceptron.

Figure 3.5 (a) Simple linear support vector machine.

(b) Support vector machine and transaction vector machine Figure 3.6 Support vector machine using 5th degree polynomial kernel function

Figure 3.7 (a) Duality in version space

(b) An SVM classifier on a version space Figure 3.8 (a) Simple margin for query b (b) Simple margin for query a Figure 3.9 (a) MaxMin margin for query b (b) MaxRatio margin for query e. Figure 4.1 Spam filter using active learning

Figure 4.2 Perceptron/SVM active spam filter Figure 4.3 Main interface of the program

Figure 4.4 Interface for selecting data folder

Figure 4.5 Notification of successful data cleaning process Figure 4.6 Processing result notification interface

Figure 4.7 Perceptron algorithm results

Figure 4.8 Configuration file structure of ActiveExperiment program Figure 4.9 Result of running SIMPLE algorithm

Figure 4.10 Results of running the SELF_CONF algorithm Figure 4.11 Results of running the KFF algorithm

Figure 4.12. Results of running the BALANCE_EE algorithm

LIST OF TABLES

Table 4.1 Example content of four letters

Table 4.2 Dictionaries and indices for data in table 4.1 Table 4.3 Vector representation for data in table 4.1

Table 4.4 Results of running 20 queries of the algorithms

CHAPTER I: INTRODUCTION

1.1 Topic introduction

1.1.1 Reasons for choosing the topic

Nowadays, email has become a powerful tool to serve the information exchange needs of organizations, businesses as well as individuals. Email helps people connect anywhere, anytime with work and personal life. However, email is also being exploited to spread spam, spread computer viruses and online fraud, causing great damage to users.

Spam is a mass email that is sent with content that the recipient does not expect, does not want to see, or contains content that is not relevant to the recipient and is often used to send advertising information. Due to its relatively low cost compared to other advertising methods, spam now accounts for a large and increasing proportion of the total number of emails sent over the Internet. The appearance and increase of spam not only causes annoyance and wastes the recipient's time, but also affects the Internet connection and slows down the processing speed of the email server, causing great economic damage.

Spam is one of the biggest challenges that customers and service providers have to deal with today. Spam has become a professional form of advertising, spreading viruses, stealing information... with many extremely sophisticated tricks and tactics. Users will have to spend a lot of time deleting "uninvited" emails, if they are not careful, they can be infected with viruses, trojans, spyware... and more seriously, they can lose information such as credit cards, bank accounts through emails in the form of letters that trick users into thinking they are legitimate (phishing).

To eliminate or minimize the impact of spam, many different approaches have been studied and used. Solutions to combat spam are diverse, ranging from legal efforts to create laws to prevent the dissemination of spam to technical solutions to detect and

Prevent spam at different stages of the message creation and delivery process.

Of course, spammers will continually improve their tactics/methods, so it is important that spam prevention measures “learn” how spam patterns change over time to be effective. And spam prevention must be implemented as quickly as possible so as not to affect other systems and work.

From the characteristics of the email system such as user interaction and spam variation, the thesis studies active learning and determines its suitability for the spam filtering problem. The topic "Research on active learning methods and applications for spam filtering problems" is conducted to propose a method for building a spam filter that can "learn" how spam changes and take advantage of user interaction to provide classification queries for emails, helping to classify spam effectively and accurately.

Within the scope of the topic, the thesis conducts research on some solutions for learning spam based on active learning methods (active filters). The research content includes testing on real data to clarify the filtering ability of active filters, comparing the effectiveness of the methods applied in the filter.

1.1.2 Objectives of the topic

To eliminate spam, email service providers have integrated many spam filtering programs into email services. Spam filtering programs are mainly based on machine learning methods through a learning set. However, based on the reality: email is an online service, emails are updated and changed over time and there is interaction between mailbox users and the system, so the topic has focused on researching active learning sets and applying them to the problem of spam filtering.

Based on the determination of the research type of the topic as theoretical research and experimental application, the goal of the topic is to learn about active learning methods and find solutions to the spam filtering problem, choose a suitable model to apply to the spam filtering problem with the following criteria:

- Fast spam filtering, accurate detection of spam mail.

- Take advantage of the ability to interact with mail service users, user mail classification to increase the amount of labeled mail as well as the quality of labeled data.

- Ability to adapt to spam variations, proactively filtering out increasingly sophisticated spam.

Just as in the field of computer virus protection, hackers are always looking for ways to counter anti-virus programs, in spam filtering, spammers are always looking for ways to effectively avoid spam filters. Therefore, spam is always being transformed and improved by spammers. Using active learning methods for spam filtering problems enriches the set of solutions for the problem of identifying changing objects.

Active spam filtering reduces the cost and time of data collection, because it is built on the interaction between the learning engine and the user to identify spam or normal mail.

With the above stated objectives, the thesis mainly focuses on active learning methods, applying learning sets to find solutions to the problem of spam filtering. To test and evaluate the results, the thesis uses experimental programs that have pre-installed learning sets that the thesis researches, collects real data, builds a program to process data into knowledge to train experimental learning sets to detect spam accurately and effectively.

1.1.3 Project implementation stages

The research process of the thesis is carried out through stages.

after:

Learn active learning methods and apply them to spam filtering problems - 1

Comment