Suppose that we are interested in classifying $n$ points in a
$z$-dimensional space into two groups having response 1 and response
0 as the target variable. In some real data cases in customer classification,
it is difficult to discriminate the favorable customers showing
response 1 from others because many response 1 points and 0 points
are closely located. In such a case, to find the denser regions
to the favorable customers is considered to be an alternative.
Such regions are called the bumps, and finding them is called the
bump hunting.
By pre-specifying a pureness rate $p$ in advance a maximum capture rate $c$ could
be obtained; the pureness rate is the ratio of the number of response 1 points
to the total number of points in the target region; the capture rate is the ratio
of the number of response 1 points to the total number of points in the total
regions. Then a trade-off curve between $p$ and $c$ can be constructed. Thus,
to find the bump regions is equivalent to construct the trade-off curve.
In order to make future actions easier, we adopt simpler boundary shapes for
the bumps such as the union of $z$-dimensional boxes located parallel to some
explanation variable axes; this means that we adopt the binary decision tree.
Since the conventional binary decision tree will not provide the maximum capture
rates because of its local optimizer property, some probabilistic methods would
be required. Here, we use the genetic algorithm (GA) specified to the tree structure
to accomplish this; we call this the tree GA. Using the property that the tree
GA has a tendency to provide many local maxima of the capture rates, we can estimate
the upper bound curve for the trade-off curve by using the extreme-value statistics.
However, the bump regions obtained by using the tree GA are conservative comparing
to the optimal regions because the tree GA finds the shrunk regions. Therefore,
we should know how accurate the trade-off curve using the tree GA. In this paper,
we have assessed the accuracy for the trade-off curve in typical fundamental
cases that may be observed in real customer data cases. We have found that the
proposed tree GA can construct the trade-off curve which is close to the optimal
one.
|