University of Notre Dame
Browse

File(s) under permanent embargo

Data Stream Summarization for Machine Learning

thesis
posted on 2006-04-21, 00:00 authored by Alec Pawling
Data stream summaries capture the essential characteristics of thestream in a managable amount of space.In this thesis, we evaluatethree data stream summarization methods for applications in machinelearning. One summary is based on equal width binning and is verysimple to implement. The other two are based on approximate quantiles,one of which allows us to make information entropy computations within specified errorbounds with high probability. Using popular techniques frommachine learning---information gain, gain ratio, andNaive-Bayes classification---we evaluate thesesummaries for accuracy and memory utilization against offlinecomputations that assume all data is available for repeatedaccess. Our results indicate that while summarization based on equalwidth binning performs very well on data streams with a stationarydistribution,performance degrades on certain non-stationary distributions. In constrast,summarizations based on approximate quantiles are consistently close to the offline values for a variety of stationary and non-stationarydistributions.

History

Date Modified

2017-06-02

Research Director(s)

Amitabh Chaudhary

Committee Members

Greg Madey Nitesh Chawla

Degree

  • Master of Science in Computer Science and Engineering

Degree Level

  • Master's Thesis

Language

  • English

Alternate Identifier

etd-04212006-102708

Publisher

University of Notre Dame

Program Name

  • Computer Science and Engineering

Usage metrics

    Masters Theses

    Categories

    No categories selected

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC