In difficult classification problems of the zdimensional points
into two groups having 01 responses due to the messy data structure,
it is more favorable to search for the denser regions for the response
1 assigned points than to find the boundaries to separate the two
groups. To such problems often seen in customer databases, we have
developed a bump hunting method using probabilistic and statistical
methods as shown in the previous study. By specifying a pureness
rate in advance, a maximum capture rate will be obtained. In finding
the maximum capture rate, we have used the decision tree method
combined with the genetic algorithm.
Then, a tradeoff curve between the pureness rate and the capture
rate can be constructed.
However, such a tradeoff curve could be optimistic if the training
data set alone is used; we might obtain an illusional curve when
the number of samples is small and the number of explanation variables
is large to some extent.
Therefore, we should be careful in assessing the accuracy of the
tradeoff curve.
Using the accuracy evaluation procedures such as the cross validation
or the bootstrapped holdout method combined with the training
and test data sets, we have shown that the actually applicable
tradeoff curve can be obtained.
We have also shown that an attainable upper bound tradeoff curve
can be estimated by using the extremevalue statistics because
the genetic algorithm provides many local maxima of the capture
rates with different initial values. However, this curve may also
be optimistic because we use the training data set alone.
In this paper, we propose a new genetic algorithm to tackle this
problem. In the procedure of the new genetic algorithm, the evaluation
to each evolution procedure is applied to the test data results
rather than to the training data results. Due to this expansion
for the genetic algorithm, we can expect the accuracy for the actually
applicable upper bound for the tradeoff curve by applying the
extremevalue statistics by the new accuracy evaluation procedures
such as the cross validation or the bootstrapped holdout.

data
mining, customer database, decision tree, genetic algorithm,
extremevalue statistics, tradeoff curve, accuracy assessment,
bootstrapped holdout, new genetic algorithm.

