Informative Sampling For Large Unbalanced Data Sets

Title	Informative Sampling For Large Unbalanced Data Sets
Publication Type	Conference Paper and Presentation
Year of Publication	2008
Authors	Lu, Z, Rughani, AI, Tranmer, BI, Bongard, J
Conference Name	4th Workshop on Medical Applications of Genetic and Evolutionary Computation at GECCO 2008
Date Published	2008
Abstract	Selective sampling is a form of active learning which can re- duce the cost of training by only drawing informative data points into the training set. This selected training set is ex- pected to contain more information for modeling compared to random sampling, thus making modeling faster and more accurate. We introduce a novel approach to selective sam- pling, which is derived from the Estimation-Exploration Al- gorithm (EEA). The EEA is a coevolutionary algorithm that uses model disagreement to determine the signi¯cance of a training datum, and evolves a set of models only on the selected data. The algorithm in this paper trains a popu- lation of Arti¯cial Neural Networks (ANN) on the training set, and uses their disagreement to seek new data for the training set. A medical data set called the National Trauma Data Bank (NTDB) is used to test the algorithm. Experi- ments show that the algorithm outperforms the equivalent algorithm using randomly-selected data and sampling evenly from each class. Finally, the selected training data reveals which features most a®ect outcome, allowing for both im- proved modeling and understanding of the processes that gave rise to the data.
URL	http://www.cs.uvm.edu/~jbongard/papers/2008_GECCO_Lu.pdf

Status:

Published

Attributable Grant:

CSYS

Grant Year:

Year1

© 2021 Vermont EPSCoR
Website Funding provided by Grant: NSF OIA 1556770