Suppose
that we are interested in classifying $n$ points in a $z$-dimensional
space into two groups having response 1 and response 0 as the target
variable. In some real data cases in customer classification, it
is difficult to discriminate the favorable customers showing response
1 from others because many response 1 points and 0 points are closely
located. In such a case, to find the denser regions to the favorable
customers is considered to be an alternative. Such regions are
called the bumps, and finding them is called the bump hunting.
By pre-specifying a pureness rate $p$ in advance a maximum capture rate $c$ could
be obtained; the pureness rate is the ratio of the number of response 1 points
to the total number of points in the target region; the capture rate is the ratio
of the number of response 1 points to the total number of points in the total
regions. Then a trade-off curve between $p$ and $c$ can be constructed. Thus,
the bump hunting is the same as the trade-off curve constructing.
In order to make future actions easier, we adopt simpler boundary shapes for
the bumps such as the union of $z$-dimensional boxes located parallel to some
explanation variable axes; this means that we adopt the binary decision tree.
Since the conventional binary decision tree will not provide the maximum capture
rates because of its local optimizer property, some probabilistic methods would
be required. Here, we use the genetic algorithm (GA) specified to the tree structure
to accomplish this; we call this the tree GA.
The tree GA has a tendency to provide many local maxima of the capture rates
unlike the ordinary GA. According to this property, we can estimate the upper
bound curve for the trade-off curve by using the extreme-value statistics. However,
these curves could be optimistic if they are constructed using the training data
alone. We should be careful in assessing the accuracy of these curves. By applying
the test data, the accuracy of the trade-off curve itself can easily be assessed.
However, the property of the local maxima would not be preserved. In this paper,
we have developed a new tree GA to preserve the property of the local maxima
of the capture rates by assessing the test data results in each evolution procedure.
Then, the accuracy of the trade-off curve and its upper bound curve are assessed. |
|
|
|
|
Data
Mining, Genetic Algorithm, Bump Hunting, Extreme-Value Statistics,
Accuracy
Assessment.
|
|
|