Data stream summaries capture the essential characteristics of the stream in a managable amount of space. In this thesis, we evaluate three data stream summarization methods for applications in machine learning. One summary is based on equal width binning and is very simple to implement. The other two are based on approximate quantiles, one of which allows us to make information entropy computations within specified error bounds with high probability. Using popular techniques from machine learning—information gain, gain ratio, and Naive-Bayes classification—we evaluate these summaries for accuracy and memory utilization against offline computations that assume all data is available for repeated access. Our results indicate that while summarization based on equal width binning performs very well on data streams with a stationary distribution, performance degrades on certain non-stationary distributions. In constrast, summarizations based on approximate quantiles are consistently close to the offline values for a variety of stationary and non-stationary distributions.
|Contributor||Greg Madey, Committee Member|
|Contributor||Amitabh Chaudhary, Committee Chair|
|Contributor||Nitesh Chawla, Committee Member|
|Degree Level||Master's Thesis|
|Degree Discipline||Computer Science and Engineering|
|Departments and Units|