Boosted Decision Trees for Multivariate, Hierarchically Clustered, and Longitudinal Data

Miller, Patrick J.

doi:10.7274/qz20sq89x8q

MillerPJ032017D.pdf (5.83 MB)

Boosted Decision Trees for Multivariate, Hierarchically Clustered, and Longitudinal Data

thesis

posted on 2017-03-28, 00:00 authored by Patrick J. Miller

The problem of finding structure in big data sets is becoming increasingly relevant to psychologists as it becomes easier and cheaper to collect data on human behavior. This dissertation focuses on the problem of identifying important structural features like main effects, nonlinear effects, and interactions in big data sets when the number of predictors is large. In general, this goal can be referred to as exploratory regression analysis. Exploratory regression analysis is beneficial because the results suggest testable hypotheses, can limit the number of plausible models, and help avoid errors in model specification. Exploratory regression analysis is usually carried out using basic data visualization techniques, simple statistical models, or by fitting a number of parametric models and selecting the best from among them. However, these procedures can require strong assumptions and may not be feasible when the number of predictors is large.

Gradient tree boosting (friedman_greedy_2001) is a promising alternative for exploratory regression analysis because it builds an interpretable model that approximates nonlinear effects and interactions among predictors without a priori specification. However, it is not clear how to build and interpret gradient tree boosting models in the context of multivariate, longitudinal, and hierarchically clustered data commonly found in psychological research.

This dissertation develops two procedures for estimating gradient tree boosting models for multivariate, longitudinal and hierarchically clustered data. Multivariate tree boosting selects predictors that explain covariance in multiple outcomes. Mixed effects tree boosting takes hierarchically clustered data into account by treating a grouping variable as random. Longitudinal data can be modeled in boosted decision trees by including time as a candidate for splitting in mixed effects tree boosting. These procedures are illustrated by application to real data. Simulations demonstrate that the methods balance true and false positive rates when selecting variables, and achieve low prediction error at sample and effect sizes commonly observed in psychology.

History

Date Created

2017-03-28

Date Modified

2018-10-30

Defense Date

2017-03-24

Research Director(s)

Gitta Lubke

Degree

Doctor of Philosophy

Degree Level

Doctoral Dissertation

Program Name

Psychology

Usage metrics

Keywords

nonparametric regression nonlinear Boosted decision trees

Licence

Public Domain Mark 1.0 (No Copyright)

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Boosted Decision Trees for Multivariate, Hierarchically Clustered, and Longitudinal Data

History

Date Created

Date Modified

Defense Date

Research Director(s)

Degree

Degree Level

Program Name

Usage metrics

Categories

Keywords

Licence

Exports