Suppose
that we are interested in classifying the points into two groups
giving 0-1 responses in $z$-dimensional space which corresponds to
$z$
explanation variables. So far a lot of classification tools which
adopt
the misclassification rate to evaluate how well the discrimination
be
achieved have been proposed and played a key role in the field of
statistical learning or data mining. In the real world, however,
we
often encounter the cases that the misclassification rate remains
considerably high no matter how carefully we search for the
discriminating boundaries if we want to capture the points to some
extent with rather simpler rules. In such a messy case, the optimization
criterion in classification goes to obtaining the larger number of
points of response 1, we are interested in, with a specified pureness
rate. Such a region is called to be a bump or a hotspot. Our primary
objective is to find the bumps.
To use the information of the boundaries of the bumps with ease,
it would
be beneficial that the bumps have much simpler shapes of their boundaries
such as the $z$-dimensional box located parallel to some explanation
variable axes. The problem raised is, however, how to find such a
boundary for
some bump. The decision tree algorithms find the splitting point
for an
explanation variable to construct much purer sub-regions using the
entropy or
the Gini's index criterion, from the top node to the bottom leafs
step by
step. Regarding that to produce the purer regions is just the same
as to exclude
those regions, this splitting criterion may also be applicable to
produce the denser region for response 1 including response 0 points
to
some extent. This paper deals with this point.
Mimicking the real data case in a customer database of a correspondence
course in Japan, we investigate the fundamental data cases such that
the
marginal distributions for response 1 are rather simple shaped, such
as
monotonic or unimodal, and those for response 0 is almost uniformly
distributed. The unimodal and monotonic cases are studied by using
the
normal distributions located in the center of and at the corner of
the
uniform distributions, respectively. Under such restricted conditions,
we have found that the Gini's index
can detect the boundaries for the bumps as if eyes of human-being
do so. |
|
|
|
|
Bump
Hunting, Capture Rate, Pureness
Rate, Boundary Detection, Decision tree. |
|
|