In difficult classification problems of the z-dimensional
points into two groups giving 0|1 responses due to the messy data
structure, it is more favorable to search for the denser regions
for the response 1 points than to find the boundaries to separate
the two groups. For such problems which can often be seen in customer
databases, we have developed a bump hunting method using probabilistic
and statistical methods as shown in the previous study. By specifying
a pureness rate in advance, a maximum capture rate will be obtained.
In finding the maximum capture rate, we have used the decision
tree method combined with the genetic algorithm. Then, a trade-off
curve between the pureness rate and the capture rate can be constructed.
However, such a trade-off curve could be optimistic if the training
data set alone is used. Therefore, we should be careful in assessing
the accuracy of the tradeoff curve. Using the accuracy evaluation
procedures such as the cross validation or the bootstrapped hold-out
method combined with the training and test data sets, we have shown
that the actually applicable trade-off curve can be obtained. We
have also shown that an attainable upper bound trade-off curve
can be estimated by using the extreme-value statistics because
the genetic algorithm provides many local maxima of the capture
rates with different initial values. We have constructed the three
kinds of trade-off curves; the first is the curve obtained by using
the training data; the second is the return capture rate curve
obtained by using the extreme-value statistics; the last is the
curve obtained by using the test data. These three are indispensable
like the Trinity to comprehend the whole figure of the trade-off
curve between the pureness rate and the capture rate. This paper
deals with the behavior of the trade-off curve from a statistical
viewpoint.
|