Data Stream Summarization for Machine Learning

Master's Thesis
Thumbnail

Abstract

Data stream summaries capture the essential characteristics of the stream in a managable amount of space. In this thesis, we evaluate three data stream summarization methods for applications in machine learning. One summary is based on equal width binning and is very simple to implement. The other two are based on approximate quantiles, one of which allows us to make information entropy computations within specified error bounds with high probability. Using popular techniques from machine learning—information gain, gain ratio, and Naive-Bayes classification—we evaluate these summaries for accuracy and memory utilization against offline computations that assume all data is available for repeated access. Our results indicate that while summarization based on equal width binning performs very well on data streams with a stationary distribution, performance degrades on certain non-stationary distributions. In constrast, summarizations based on approximate quantiles are consistently close to the offline values for a variety of stationary and non-stationary distributions.

Attributes

Attribute NameValues
URN
  • etd-04212006-102708

Author Alec Pawling
Advisor Amitabh Chaudhary
Contributor Greg Madey, Committee Member
Contributor Amitabh Chaudhary, Committee Chair
Contributor Nitesh Chawla, Committee Member
Degree Level Master's Thesis
Degree Discipline Computer Science and Engineering
Degree Name MSCSE
Defense Date
  • 2006-03-20

Submission Date 2006-04-21
Country
  • United States of America

Subject
  • feature selection

  • Naive-Bayes

  • Data streams

  • machine learning

Publisher
  • University of Notre Dame

Language
  • English

Record Visibility and Access Public
Content License
  • All rights reserved

Departments and Units

Files

Please Note: You may encounter a delay before a download begins. Large or infrequently accessed files can take several minutes to retrieve from our archival storage system.