AEFP 45th Annual Conference

Toward a Meaningful Impact through Research, Policy & Practice

March 19-21, 2020

Beyond Parametric Models: Data Mining Approaches for Predicting the Impacts of Financial Aid on Degree Attainment

Presenter: 
Kubra Say, SUNY at Buffalo, kubrasay@buffalo.edu

Today, increasing demand for higher education proves that there is a positive long term outcome of having a college diploma. Considering the significance as well as economic and social returns of higher education, the US government increases existing sources and subsidies to provide affordable education to its citizens (Hillman, 2015). Unfortunately, not each student who gets benefits from the higher education system and its sources could attain a degree. Considering college dropouts, it is believed that decreasing college completion might lead to a deteriorating effect on the economy, society, and individuals in the long-term. Addressing the issue, it is claimed that the data mining techniques might help higher education institutions and policymakers by informing them regarding students' academic standings and unmet needs in advance. In that way, to increase the college degree attainment, they would be able to identify at-risk students, create appropriate policies and apply early interventions for those students (Raju, 2012).
In higher education literature, there is a significant volume of research conducted on persistence and degree completion. However, there is a limited number of studies using data mining techniques on these outcomes and have been no studies found yet attempting to identify particularly economic factors that influence students' persistence and degree attainment behavior by using those new methods. As observed, data mining in education literature is quite new and getting more popular. Findings from current studies support that the predictions from these new techniques (non-parametric) are more accurate compared to traditional (parametric) methods such as regression analysis and logit modeling. Thus, it is anticipated that this study will make a significant contribution to the related literature on degree attainment and data mining. For this purpose, the current study will answer the following questions:
1. Which predictive analysis techniques (logistic regression, GAM, MARS, decision tree, or random forest) does determine students' degree completion behavior more accurate?
2. Which characteristics are most likely to predict degree attainment in 4-year institutions?
3. Which type of financial aid increases the likelihood of degree completion in those institutions?
Data used in this study will be derived from the Educational Longitudinal Study (ELS 2002-12). Furthermore, the variables selected for analyses will consist of students’ demographics as well as their socio-economic and family background (gender, race/ethnicity, parent’s education, socio-economic status, family structure, # of siblings, etc.), pre-college characteristics (high school GPA, sector of high school, SAT/ACT scores) college characteristics (sector of college, college GPA, major and so on) and more importantly economic factors (cumulative loan amount, cumulative amount of Pell received, the sources used to support individuals’ post-secondary education, etc.).
Along with the traditional logistic regression model; there will be three non-parametric techniques performed, which techniques will be the Generalized Additive Model (GAM), Multivariate Adaptive Regression Splines (MARS), Decision Trees, and Random Forest. Those models and their prediction accuracy will be compared, and then the research questions will be answered according to the most accurate model.
References
Hillman, N. W. (2015). Borrowing and repaying student loans. Journal of Student Financial Aid, 45(3), 5.
Raju, D. A. (2012). Predicting student graduation in higher education using data mining models: A comparison (Doctoral dissertation, University of Alabama Libraries).

Poster: 

Comments

It'd be great to use your best model (or all of them) to show the fraction of students a college could identify as at-risk after the first semester. In other words, how well does each model do at helping colleges target potential interventions? It'd be nice to know Type I and Type II error rates for these predictions as well. If I'm an administrator, how much do these models help me improve the targeting of resources? That'd be the best kind of headline number to summarize the importance of this study.

Hi Kubra, I enjoyed your poster. Additional thoughts about the relative importance of model choice and possible reasons MARS is superior would be great, but not always easy to communicate in a poster. The stepwise improvements you show on the right are helpful! Jennifer

Thank you for sharing your poster. I really appreciate seeing predictive models being developed and used. One note of caution, however. I would suggest avoiding the use of causal language. For example, “In four-year institutions, Work-Study program has positive impact on degree completion” would be better and more accurate as “In four-year institutions, Work-Study program participation is a positive predictor of degree completion”. Additionally, the 7 models don’t appear to be appreciably different from one another in predicting degree completion – ranging from 0.6947 to 0.7314 for 2-years and 0.7159 to 0.7539 for 4-years. Would it be possible to evaluate the null hypothesis that these prediction accuracies are the same – perhaps by bootstrapping your sample and deriving joint distributions of these prediction accuracies? marklong@uw.edu

Thank you for sharing your poster. I really appreciate seeing predictive models being developed and used. One note of caution, however. I would suggest avoiding the use of causal language. For example, “In four-year institutions, Work-Study program has positive impact on degree completion” would be better and more accurate as “In four-year institutions, Work-Study program participation is a positive predictor of degree completion”. Additionally, the 7 models don’t appear to be appreciably different from one another in predicting degree completion – ranging from 0.6947 to 0.7314 for 2-years and 0.7159 to 0.7539 for 4-years. Would it be possible to evaluate the null hypothesis that these prediction accuracies are the same – perhaps by bootstrapping your sample and deriving joint distributions of these prediction accuracies? marklong@uw.edu

Add new comment