Using decision trees to classify imbalanced data - 9


Table 4-17: Contraceptive Method Choice Dataset Results


Classifier

TPR

FPR

AUC

SC4.5

0.225

0.076

0.574

CSC4.5

0.333

0.092

0.621

AUC4.5

TPR mean = 0.661

FPR mean = 0.430

AUC mean = 0.616

Maybe you are interested!

Using decision trees to classify imbalanced data - 9

The AUC4.5 algorithm gives a slightly lower AUC mean = 0.616 than the SC4.5 algorithm.

and CSC4.5. But with a much higher TPR mean = 0.661, it shows that the AUC4.5 algorithm has classified the minority class more accurately than the SC4.5 and CSC4.5 algorithms, although the results are not high.

In addition, the Contraceptive Method Choice set of attribute values ​​belongs to continuous type. It has a great influence on the classification process. The test result has a standard deviation = 0.02028.

Tic-Tac-Toe Endgame: Discrete attributes = 9, minority class ratio = 34.62%.


Table 4-18: Results table of 10 tests on Tic-Tac-Toe Endgame dataset


Test times

TPR

FPR

AUC

Variance

Standard deviation

1

0.745

0.105

0.820



2

0.807

0.104

0.851

3

0.708

0.070

0.819

4

0.779

0.098

0.840

5

0.748

0.101

0.823

6

0.776

0.105

0.835

7

0.794

0.112

0.841

8

0.753

0.097

0.828

9

0.785

0.126

0.829

10

0.764

0.151

0.807


TPR mean =0.766

FPR mean =0.107

AUC mean =0.829

0.00017

0.01285

Source: author's research


Table 4-19: Tic-Tac-Toe Endgame dataset results


Classifier

TPR

FPR

AUC


SC4.5

0.631

0.062

0.784

CSC4.5

0.640

0.062

0.789

AUC4.5

TPR mean = 0.766

FPR mean = 0.107

AUC mean = 0.829

The AUC4.5 algorithm gives better classification results in all three metrics TPR, FPR and AUC on the imbalanced dataset. Again, it is confirmed that the dataset with discrete-valued attributes gives better classification results than the dataset with continuous-valued attributes. The standard deviation = 0.01285 is quite small.

4.5 Evaluation of experimental results


Through experimental results, analysis on eight data sets was tested on D Test set 10 times.


and taking the average results for the TPR mean , FPR mean and AUC mean indices (Table V) and the variance - standard deviation index (Table IV), we have the following comments:

+ The imbalance ratio between classes does not greatly affect the classification results of the proposed algorithm AUC4.5.

+ For data sets with discrete-valued attributes:


- Gives good classification results for minority class on imbalanced dataset.


- In which, all datasets give good classification results, superior to the two algorithms SC4.5 and CSC4.5. In particular, the Car Evaluation and Mushroom datasets have 100% accurate classification results.

- The standard deviation of the two sets Car Evaluation and Mushroom is zero (=0). The deviation of the two sets Nursery and Tac-Tic-Toe Endgame is not large, proving the stability of the algorithm as well as the data belonging to the group of discrete values.

+ For data sets with attributes having continuous values:


- Only the Ecoli dataset has higher classification results than the two algorithms SC4.5 and CSC4.5. However, the standard deviation of the Ecoli dataset is quite high, second only to the Wine Quality dataset.

– Red , indicates that the data type needs to be reviewed regularly.


- The remaining three datasets Wine Quality – Red , Wine Quality – White and Contraceptive Method Choice have higher TPR mean than SC4.5 and CSC4.5 algorithms. If we ignore the FPR mean (misclassifying the majority class into the minority class) to get the AUC mean result

high, the AUC4.5 algorithm has achieved the goal of improving the classification accuracy of

minority class in imbalanced dataset.


- The standard deviation of all four continuous type data sets is the highest among the eight data sets. in the order of 0.02028, 0.02631, 0.03022 and 0.09520. Showing the stability, the data distribution in the continuous data set is the issue to consider.


CHAPTER 5. CONCLUSION AND DEVELOPMENT DIRECTION


In this thesis, the AUC4.5 algorithm is improved from the C4.5 algorithm using the AUC value instead of Gain-entropy in the tree splitting and pruning criteria to improve the classification efficiency of imbalanced data, specifically on the minority class, suitable for binary imbalanced classification. Experimental results evaluated on eight real imbalanced data sets from the UCI machine learning repository [28] have shown that the improved AUC4.5 algorithm gives better classification efficiency than the SC4.5 and SCS4.5 algorithms. This confirms the importance of using the AUC value directly in training in data sets that affect the classification process. In particular, the improved method does not sacrifice the FPR value to increase the TPR value to achieve the highest AUC value.

The proposed method does not need to set different costs such as misclassification cost as in cost-sensitive learning method, so the training time is less but the classification performance is better.

The method improves the correct classification rate on the minority class in the imbalanced data set. However, continuous-valued data is also an issue that needs to be considered and processed before being classified when applied on the AUC4.5 algorithm.

With the results achieved by the algorithm. If the system is applied to medical diagnostic applications, it will improve the diagnostic efficiency, if applied to the field of intrusion and attack detection, it will improve the efficiency of system monitoring. However, at present, there is no method that is completely optimal for all real data sets and in the data mining industry, this is accepted. Based on the research and the results achieved, we realize that there are many issues that need to be further researched and developed to contribute to the field of imbalanced data classification in particular and the field of data mining in general.


REFERENCES


[1] JR Quinlan, “Induction of Decision Trees,” Mach. Learn. , vol. 1, no. 1, pp. 81–106, 1986.

[2] J. Han, M. Kamber, and J. Pei, Data mining : Concepts and Techniques . Elsevier/Morgan Kaufmann, 2012.

[3] I. H. Witten, E. Frank, and M. a. Hall, Data Mining: Practical Machine Learning Tools and Techniques, Third Edition , vol. 54, no. 2. 2011.

[4] V. Ganganwar, “An overview of classification algorithms for imbalanced datasets,” Int.

J. Emerg. Technol. Adv. Eng. , vol. 2, no. 4, pp. 42–47, 2012.


[5] Y. Yang and G. Ma, “Ensemble-based active learning for class imbalance problem,” J. Biomed. Sci. Eng. , vol. 03, no. 10, pp. 1022–1029, Oct. 2010.

[6] B. Zadrozny, J. Langford, and N. Abe, “Cost-sensitive learning by cost-proportionate example weighting,” in Third IEEE Int. Conf. on Data Mining , 2003, pp. 435–442.

[7] Y. Tang, S. Krasser, D. Alperovitch, and P. Judge, “Spam Sender Detection with Classification Modeling on Highly Imbalanced Mail Server Behavior Data,” in Proc. of Intl. Conf. on Artificial Int. and Pattern Recognition , 2008, pp. 174–180.

[8] V. Engen, “Machine learning for network based intrusion detection.,” Bounemouth University, 2010.

[9] X. Liu, J. Wu, and Z. Zhou, “Exploratory Under-Sampling for Class-Imbalance Learning,” in Sixth Int. Conf. on Data Mining (ICDM'06) , 2006, pp. 965–969.

[10] S.-J. Yen and Y.-S. Lee, “Cluster-based under-sampling approaches for imbalanced data distributions,” Expt. Syst. with Appl. , vol. 36, no. 3, pp. 5718–5727, Apr. 2009.


[11] N.M. Phuong, TT. Anh Tuyet, NT. Hong, and D. X. Tho, “Random Border Undersampling: A new algorithm to reduce random elements on the border in imbalanced data,” in FAIR - Basic and Applied Research in Information Technology , 2015.


[12] N. Japkowicz, “Learning from Imbalanced Data Sets: A Comparison of Various Strategies,” AAAI wsh. Learn. from imb. data sets , vol. 68, pp. 10–15, 2000.

[13] N. V Chawla, K.W. Bowyer, LO Hall, and W.P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” ​​J. Artif. Intell. Res. , vol. 16, pp. 321–357, 2002.

[14] H. Han, W.-Y. Wang, and B.-H. Mao, “Borderline-SMOTE: A New Over-Sampling Method in Imbalanced Data Sets Learning,” Springer, Berlin, Heidelberg, 2005, pp. 878– 887.

[15] G. Weiss, K. McCarthy, and B. Zabar, “Cost-sensitive learning vs. sampling: Which is best for handling unbalanced classes with unequal error costs?,” Dmin , pp. 1–7, 2007.

[16] C. Drummond and RC Holte, “Exploiting the Cost(In)sensitivity of Decisions Tree Splitting Criteria,” Int. Conf. Mach. Learn. , vol. 1, no. 1, pp. 239–246, 2000.

[17] W. Fan, S. Stolfo, J. Zhang, and P. Chan, “AdaCost: Misclassification Cost-Sensitive Boosting,” ’99 Proc. Sixt. Intl. Conf. Mach. Learn. , pp. 97–105, 1999.

[18] Y. Sun, MS Kamel, AKC Wong, and Y. Wang, “Cost-sensitive boosting for classification of imb. data,” Patt. Recog. , vol. 40, no. 12, pp. 3358–3378, 2007.

[19] H. Guo and H.L. Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation : The DataBoost-IM Approach,” ACM SIGKD Explor. Newsl. - Spec. issue Learn. from imb. datasets , vol. 6, no. 1, pp. 30–39, 2004.

[20] M. a Maloof, “Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown,” Analysis , vol. 21, no. II, pp. 1263–1284, 2003.

[21] JR Quinlan, “J. Ross Quinlan. C4.5 - Programs for Machine Learning,” Morgan Kaufmann , vol. 5, no. 3. p. 302, 1993.

[22] T. Fawcett, “An introduction to ROC analysis,” Pattern Recognit. Lett. , vol. 27, no. 8, pp. 861–874, 2006.

[23] MR Tolun and SM Abu-Soud, “An Inductive Learning Algorithm for Production Rule Discovery,” 1999.


[24] PT Huan and LH Bac, “Mining frequent itemsets from transaction data with multiple minimum frequent thresholds on multi-core processors,” Can Tho Univ. J. Sci. , vol. CN, p. 155, Oct. 2017.

[25] A. Tran, T. Truong, and LH Bac, “Efficiently mining ass. rules based on max. single constraints,” Vietnam J. Comp. Sci. , vol. 4, no. 4, pp. 261–277, Nov. 2017.

[26] D. Nguyen, B. Vo, and LH Bac, “CCAR: An efficient method for mining class association rules with itemset constraints,” Eng. Appl. Artif. Intell. , vol. 37, pp. 115–124, Jan. 2015.

[27] SMA-S. Mehmet R. Tolun, Hayri Sever, Mahmu, Hayri Sever, Mahmut Uludag, “ILA-2: An Inductive Learning Algorithm For Knowledge Discovery,” Cybern. Syst. , vol. 30, no. 7, pp. 609–628, Oct. 1999.

[28] CL Blake and CJ Merz, “UCI Repository of machine learning databases,” Univ. Calif., p. http://archive.ics.uci.edu/ml/, 1998.

[29] J.-S. Lee, J. Lee, and B. Gu, “AUC-based C4.5 decision tree algorithm for imbalanced data classification”, 2016.

Comment


Agree Privacy Policy *