The accuracy of the trade-off curve in the bump hunting


Hideo Hirose, Takahiro Yukizane


"7th Hawaii International Conference on Statistics, Mathematics and Related Fields", January 17 (Thursday) to January 19 (Saturday), 2008 at the Waikiki Beach Marriott Resort & Spa in Honolulu


In difficult classification problems of the z-dimensional points into two groups having 0-1 responses due to the messy data structure, it is more favorable to search for the denser regions for the response 1 assigned points than to find the boundaries to separate the two groups. To such problems often seen in customer databases, we have developed a bump hunting method using probabilistic and statistical methods as shown in the previous study. By specifying a pureness rate in advance, a maximum capture rate will be obtained. In finding the maximum capture rate, we have used the decision tree method combined with the genetic algorithm.
Then, a trade-off curve between the pureness rate and the capture rate can be constructed.
However, such a trade-off curve could be optimistic if the training data set alone is used; we might obtain an illusional curve when the number of samples is small and the number of explanation variables is large to some extent.
Therefore, we should be careful in assessing the accuracy of the trade-off curve.
Using the accuracy evaluation procedures such as the cross validation or the bootstrapped hold-out method combined with the training and test data sets, we have shown that the actually applicable trade-off curve can be obtained.
We have also shown that an attainable upper bound trade-off curve can be estimated by using the extreme-value statistics because the genetic algorithm provides many local maxima of the capture rates with different initial values. However, this curve may also be optimistic because we use the training data set alone.
In this paper, we propose a new genetic algorithm to tackle this problem. In the procedure of the new genetic algorithm, the evaluation to each evolution procedure is applied to the test data results rather than to the training data results. Due to this expansion for the genetic algorithm, we can expect the accuracy for the actually applicable upper bound for the trade-off curve by applying the extreme-value statistics by the new accuracy evaluation procedures such as the cross validation or the bootstrapped hold-out.


Key Words
data mining, customer database, decision tree, genetic algorithm, extreme-value statistics, trade-off curve, accuracy assessment, bootstrapped hold-out, new genetic algorithm.



Times Cited in Web of Science:

Cited in Books: