In difficult classification problems of the z-dimensional points
into two groups having 0-1 responses due to the messy data structure,
it is more favorable to search for the denser regions for the response
1 assigned points than to find the boundaries to separate the two
groups. To such problems often seen in customer databases, we have
developed a bump hunting method using probabilistic and statistical
methods as shown in the previous study. By specifying a pureness
rate in advance, a maximum capture rate will be obtained. In finding
the maximum capture rate, we have used the decision tree method
combined with the genetic algorithm.
Then, a trade-off curve between the pureness rate and the capture
rate can be constructed.
However, such a trade-off curve could be optimistic if the training
data set alone is used; we might obtain an illusional curve when
the number of samples is small and the number of explanation variables
is large to some extent.
Therefore, we should be careful in assessing the accuracy of the
trade-off curve.
Using the accuracy evaluation procedures such as the cross validation
or the bootstrapped hold-out method combined with the training
and test data sets, we have shown that the actually applicable
trade-off curve can be obtained.
We have also shown that an attainable upper bound trade-off curve
can be estimated by using the extreme-value statistics because
the genetic algorithm provides many local maxima of the capture
rates with different initial values. However, this curve may also
be optimistic because we use the training data set alone.
In this paper, we propose a new genetic algorithm to tackle this
problem. In the procedure of the new genetic algorithm, the evaluation
to each evolution procedure is applied to the test data results
rather than to the training data results. Due to this expansion
for the genetic algorithm, we can expect the accuracy for the actually
applicable upper bound for the trade-off curve by applying the
extreme-value statistics by the new accuracy evaluation procedures
such as the cross validation or the bootstrapped hold-out.
|
data
mining, customer database, decision tree, genetic algorithm,
extreme-value statistics, trade-off curve, accuracy assessment,
bootstrapped hold-out, new genetic algorithm.
|
|