Title
Accuracy assessment for the trade-off curve and its upper bound curve in the bump hunting using the new tree genetic algorithm

Authors

H. Hirose, T. Yukizane and F. Zaman


Source

"7th World Congress in Probability and Statistics", Abstract 159, July 14-19, 2008 at National University of Singapore, Singapore.


Abstract
Suppose that we are interested in classifying $n$ points in a $z$-dimensional space into two groups having response 1 and response 0 as the target variable. In some real data cases in customer classification, it is difficult to discriminate the favorable customers showing response 1 from others because many response 1 points and 0 points are closely located. In such a case, to find the denser regions to the favorable customers is considered to be an alternative. Such regions are called the bumps, and finding them is called the bump hunting.
By pre-specifying a pureness rate $p$ in advance a maximum capture rate $c$ could be obtained; the pureness rate is the ratio of the number of response 1 points to the total number of points in the target region; the capture rate is the ratio of the number of response 1 points to the total number of points in the total regions. Then a trade-off curve between $p$ and $c$ can be constructed. Thus, the bump hunting is the same as the trade-off curve constructing.
In order to make future actions easier, we adopt simpler boundary shapes for the bumps such as the union of $z$-dimensional boxes located parallel to some explanation variable axes; this means that we adopt the binary decision tree. Since the conventional binary decision tree will not provide the maximum capture rates because of its local optimizer property, some probabilistic methods would be required. Here, we use the genetic algorithm (GA) specified to the tree structure to accomplish this; we call this the tree GA.
The tree GA has a tendency to provide many local maxima of the capture rates unlike the ordinary GA. According to this property, we can estimate the upper bound curve for the trade-off curve by using the extreme-value statistics. However, these curves could be optimistic if they are constructed using the training data alone. We should be careful in assessing the accuracy of these curves. By applying the test data, the accuracy of the trade-off curve itself can easily be assessed. However, the property of the local maxima would not be preserved. In this paper, we have developed a new tree GA to preserve the property of the local maxima of the capture rates by assessing the test data results in each evolution procedure. Then, the accuracy of the trade-off curve and its upper bound curve are assessed.

Key Words
Data Mining, Genetic Algorithm, Bump Hunting, Extreme-Value Statistics, Accuracy Assessment.

Citation

@

Times Cited in Web of Science:

Cited in Books: