Operationalizing Classification in Applied Machine Learning
The increasing diversity of data sources has propelled machine learning into an equally diverse set of application domains. Across these applications, a key task is that of classification. While contemporary approaches manage to achieve impressive predictive performance on pre-structured datasets, surprisingly little work has been done to address how raw data is being structured to best address the underlying domain problem. The state of the art in domain-driven data mining, Actionable Knowledge Discovery, merely acts as a wrapper to transform domain data to feature matrices and class labels. To address these gaps in existing frameworks, we propose the Operationalized Data Science Paradigm (ODSP). Through this paradigm, we now have a formalized framework for structuring data and pipelines, time-censoring, Net Present Value considerations, interpretability and regulation compliance --- all using domain driven insights. We demonstrate the role of domain-driven problem and pipeline design across the diverse domains of cost-sensitive classification, online video content, Massive Open Online Courses (MOOCs) and auto insurance in the form of deployed solutions. For each of these use-cases, we provide a comparative ablation analysis to highlight the role of ODSP in ensuring their operational viability. As result, we show how the domain influences which questions we ask of the data and how we should interpret them.
History
Alt Title
Maturing Classification from Prototypes to Production in Applied Machine LearningDate Created
2017-11-27Date Modified
2018-04-18Defense Date
2017-05-05Research Director(s)
Nitesh ChawlaCommittee Members
Sidney D'Mello Tim Weninger Reid JohnsonDegree
- Doctor of Philosophy
Degree Level
- Doctoral Dissertation
Program Name
- Computer Science and Engineering