Informative Sampling For Large Unbalanced Data Sets

TitleInformative Sampling For Large Unbalanced Data Sets
Publication TypeConference Paper and Presentation
Year of Publication2008
AuthorsLu, Z, Rughani, AI, Tranmer, BI, Bongard, J
Conference Name4th Workshop on Medical Applications of Genetic and Evolutionary Computation at GECCO 2008
Date Published2008

Selective sampling is a form of active learning which can re-
duce the cost of training by only drawing informative data
points into the training set. This selected training set is ex-
pected to contain more information for modeling compared
to random sampling, thus making modeling faster and more
accurate. We introduce a novel approach to selective sam-
pling, which is derived from the Estimation-Exploration Al-
gorithm (EEA). The EEA is a coevolutionary algorithm that
uses model disagreement to determine the signi¯cance of a
training datum, and evolves a set of models only on the
selected data. The algorithm in this paper trains a popu-
lation of Arti¯cial Neural Networks (ANN) on the training
set, and uses their disagreement to seek new data for the
training set. A medical data set called the National Trauma
Data Bank (NTDB) is used to test the algorithm. Experi-
ments show that the algorithm outperforms the equivalent
algorithm using randomly-selected data and sampling evenly
from each class. Finally, the selected training data reveals
which features most a®ect outcome, allowing for both im-
proved modeling and understanding of the processes that
gave rise to the data.

Attributable Grant: 
Grant Year: