In difficult classification problems of the z-dimensional
points into two groups giving 0-1 responses due to the messy data
structure, we try to find the denser regions for the favorable
customers of response 1, instead of finding the boundaries to separate
the two groups. Such regions are called the bumps, and finding
the boundaries of the bumps is called the bump hunting. The main
objective of this paper is to find the largest region of the bumps
under a specified ratio of the number of the points of response
1 to the total. Then, we may obtain a trade-off curve between the
number of points of response 1 and the specified ratio. The decision
tree method with the Gini's index will provide the simple-shaped
boundaries for the bumps if the marginal density for response 1
shows a rather simple or monotonic shape. Since the computing time
searching for the optimal trees will cost much because of the NP-hardness
of the problem, some random search methods, e.g., the genetic algorithm
adapted to the tree, are useful. Due to the existence of many local
maxima unlike the ordinary genetic algorithm search results, the
extreme-value statistics will be useful to estimate the global
optimum number of captured points; this also guarantees the accuracy
of the semi-optimal solution with the simple descriptive rules.
This combined method of genetic algorithm search and extreme-value
statistics use is new. We apply this method to some artificial
messy data case which mimics the real customer database, showing
a successful result. The reliability of the solution is discussed.
|
data
mining, data science, bump hunting, genetic algorithm, extreme-value
statistics, trade-off curve, decision tree, bootstrap
|
|