Assessment of the prediction accuracy in the bump hunting procedure


H. Hirose, S. Ohi, and T. Yukizane


"6th Hawaii International Conference on Statistics, Mathematics and Related Fields", January 17-19, 2007 at the Waikiki Beach Marriott Resort & Spa in Honolulu


In a series of the previous study (Hirose 2005a, Hirose 2005b, Hirose et. al. 2006a, Yikizane et. al. 2006), we have developed a bump hunting method using probabilistic and statistical methods, where we are interested in finding denser regions for response 1 among regions for responses 0 and 1 mixed under a condition that the pureness rate is pre-specified; the pureness rate is the ratio of the number of response 1 to the number of responses 0 and 1 in the target region. A global maximum capture rate is estimated by using locally obtained maximum captured response 1 points, where the capture rate is the ratio of the number of response 1 to the number of responses 0 and 1 in the region defined. The accuracy of the estimated maximum capture rate was assessed by using the simple bootstrap method without correction formula. We have not thought seriously of the bias and the variance to the predicted estimate.
However, we are now aware of that we should treat the value of the predicted estimate very carefully. In the previous study, we have dealt with all the data as the training data or the learning data. As is well-known in machine learning world, the best fitted model using the training data would not necessarily be accurate when the test data are used. A typical research in assessing the misclassification error in classification problems is reported by Kohavi (1995), and a more general explanation is found in the literature (Hastie, et. al. 2001), where the test sample method, the cross-validation, and the bootstrap method are often used to assess the accuracy of the predicted estimate and the model. It is recommended to take into account this kind of assessment to our problem too. However, difficulties lie on the bump hunting method that we have proposed because we are interested in not only estimating the maximum capture rate but also in finding the corresponding (binary decision) rule to capture the points of response 1. Since we consider that the rule should also capture a large amount of response 1 points to the future data, a combination of the maximum capture rate and the corresponding rule becomes important. In the example in the previous study, however, we have found that the optimized rule to capture the maximum points for response 1 to the training data did not necessarily capture a large amount of response 1 points to the test data.
In this paper, we show first how badly the optimality collapses to the test data. We then discuss how to circumvent this kind of difficulty.


Key Words
bump hunting, genetic algorithm, decision tree, bias, variance, cross-validation, bootsrap, cross-validated-bootstrap, bootstrapped hold-out, prediction error, with relaxation, blood-type human-characteristics classification.



Times Cited in Web of Science:

Cited in Books: