posted on 2006-04-21, 00:00authored byAlec Pawling
Data stream summaries capture the essential characteristics of thestream in a managable amount of space.In this thesis, we evaluatethree data stream summarization methods for applications in machinelearning. One summary is based on equal width binning and is verysimple to implement. The other two are based on approximate quantiles,one of which allows us to make information entropy computations within specified errorbounds with high probability. Using popular techniques frommachine learning---information gain, gain ratio, andNaive-Bayes classification---we evaluate thesesummaries for accuracy and memory utilization against offlinecomputations that assume all data is available for repeatedaccess. Our results indicate that while summarization based on equalwidth binning performs very well on data streams with a stationarydistribution,performance degrades on certain non-stationary distributions. In constrast,summarizations based on approximate quantiles are consistently close to the offline values for a variety of stationary and non-stationarydistributions.
History
Date Modified
2017-06-02
Research Director(s)
Amitabh Chaudhary
Committee Members
Greg Madey
Nitesh Chawla
Degree
Master of Science in Computer Science and Engineering