Title
Boundary Detection for Bumps using the Gini's Index in Messy Classification Problems

Authors

H. Hirose, T. Yukizane, E. Miyano


Source

The 3rd International Conference on Cybernetics and Information Technologies, Systems and Applications: CITSA 2006


Abstract
Suppose that we are interested in classifying the points into two groups
giving 0-1 responses in $z$-dimensional space which corresponds to $z$
explanation variables. So far a lot of classification tools which adopt
the misclassification rate to evaluate how well the discrimination be
achieved have been proposed and played a key role in the field of
statistical learning or data mining. In the real world, however, we
often encounter the cases that the misclassification rate remains
considerably high no matter how carefully we search for the
discriminating boundaries if we want to capture the points to some
extent with rather simpler rules. In such a messy case, the optimization
criterion in classification goes to obtaining the larger number of
points of response 1, we are interested in, with a specified pureness
rate. Such a region is called to be a bump or a hotspot. Our primary objective is to find the bumps.
To use the information of the boundaries of the bumps with ease, it would
be beneficial that the bumps have much simpler shapes of their boundaries
such as the $z$-dimensional box located parallel to some explanation
variable axes. The problem raised is, however, how to find such a boundary for
some bump. The decision tree algorithms find the splitting point for an
explanation variable to construct much purer sub-regions using the entropy or
the Gini's index criterion, from the top node to the bottom leafs step by
step. Regarding that to produce the purer regions is just the same as to exclude
those regions, this splitting criterion may also be applicable to
produce the denser region for response 1 including response 0 points to
some extent. This paper deals with this point.
Mimicking the real data case in a customer database of a correspondence
course in Japan, we investigate the fundamental data cases such that the
marginal distributions for response 1 are rather simple shaped, such as
monotonic or unimodal, and those for response 0 is almost uniformly
distributed. The unimodal and monotonic cases are studied by using the
normal distributions located in the center of and at the corner of the
uniform distributions, respectively. Under such restricted conditions, we have found that the Gini's index
can detect the boundaries for the bumps as if eyes of human-being do so.

Key Words
Bump Hunting, Capture Rate, Pureness
Rate, Boundary Detection, Decision tree.

Citation

 

Times Cited in Web of Science: 2

Times Cited in Google Scholar:

Cited in Books:

WoS: INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL 巻: 14 号: 10 ページ: 3409-3424 発行: OCT 2011; THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS ページ: 597-600

Others: