In a series of the previous study (Hirose 2005a, Hirose 2005b,
Hirose et. al. 2006a, Yikizane et. al. 2006), we have developed
a bump hunting method using probabilistic and statistical methods,
where we are interested in finding denser regions for response
1 among regions for responses 0 and 1 mixed under a condition that
the pureness rate is pre-specified; the pureness rate is the ratio
of the number of response 1 to the number of responses 0 and 1
in the target region. A global maximum capture rate is estimated
by using locally obtained maximum captured response 1 points, where
the capture rate is the ratio of the number of response 1 to the
number of responses 0 and 1 in the region defined. The accuracy
of the estimated maximum capture rate was assessed by using the
simple bootstrap method without correction formula. We have not
thought seriously of the bias and the variance to the predicted
estimate.
However, we are now aware of that we should treat the value of the predicted
estimate very carefully. In the previous study, we have dealt with all the data
as the training data or the learning data. As is well-known in machine learning
world, the best fitted model using the training data would not necessarily be
accurate when the test data are used. A typical research in assessing the misclassification
error in classification problems is reported by Kohavi (1995), and a more general
explanation is found in the literature (Hastie, et. al. 2001), where the test
sample method, the cross-validation, and the bootstrap method are often used
to assess the accuracy of the predicted estimate and the model. It is recommended
to take into account this kind of assessment to our problem too. However, difficulties
lie on the bump hunting method that we have proposed because we are interested
in not only estimating the maximum capture rate but also in finding the corresponding
(binary decision) rule to capture the points of response 1. Since we consider
that the rule should also capture a large amount of response 1 points to the
future data, a combination of the maximum capture rate and the corresponding
rule becomes important. In the example in the previous study, however, we have
found that the optimized rule to capture the maximum points for response 1 to
the training data did not necessarily capture a large amount of response 1 points
to the test data.
In this paper, we show first how badly the optimality collapses to the test data.
We then discuss how to circumvent this kind of difficulty.
|
bump
hunting, genetic algorithm, decision tree, bias, variance,
cross-validation, bootsrap, cross-validated-bootstrap, bootstrapped
hold-out, prediction error, with relaxation, blood-type human-characteristics
classification.
|
|