In difficult classification problems of the zdimensional
points into two groups giving 01 responses due to the messy data
structure, it is more favorable to search for the denser regions
for the response 1 points than to find the boundaries to separate
the two groups. For such problems which can often be seen in customer
databases, we have developed a bump hunting method using probabilistic
and statistical methods as shown in the previous study. By specifying
a pureness rate in advance, a maximum capture rate will be obtained.
In finding the maximum capture rate, we have used the decision
tree method combined with the genetic algorithm. Then, a tradeoff
curve between the pureness rate and the capture rate can be constructed.
However, such a tradeoff curve could be optimistic if the training
data set alone is used. Therefore, we should be careful in assessing
the accuracy of the tradeoff curve. Using the accuracy evaluation
procedures such as the cross validation or the bootstrapped holdout
method combined with the training and test data sets, we have shown
that the actually applicable tradeoff curve can be obtained. We
have also shown that an attainable upper bound tradeoff curve
can be estimated by using the extremevalue statistics because
the genetic algorithm provides many local maxima of the capture
rates with different initial values. We have constructed the three
kinds of tradeoff curves; the first is the curve obtained by using
the training data; the second is the return capture rate curve
obtained by using the extremevalue statistics; the last is the
curve obtained by using the test data. These three are indispensable
like the Trinity to comprehend the whole figure of the tradeoff
curve between the pureness rate and the capture rate. This paper
deals with the behavior of the tradeoff curve from a statistical
viewpoint.
