TQ6: Ridge Regression
This exercise work will investigate ridge regression to predict the prize winnings ($1000s) given a variety of information about performance and success statistics for LPGA golfers in 2009.The table attached (see excel file) contains data related to performance and success statistics for LPGA golfers in 2009. The matrix X contains 11 predictor variables:
1. Average drive (yards)
2. Percent of fairways hit
3. Percent of greens reached in regulation
4. Average putts per round
5. Percent of sand saves (2 shots to hole)
6. Tournaments played in
7. Green in regulation putts per hole
8. Completed tournaments
9. Average percentile in tournaments (high is good)
10. Rounds completed
11. Average strokes per round
The column vector y contains the output variable, prize winnings ($1000s). For each variable in x and y. This assignment will look at the predictive ability of partial least squares regression (PLS) and compare it to the methods we've investigated previously in TQ3 and TQ4
1. Briefly describe ridge regression (including the benefits for ill-posed problems - talk about the condition number!).
2. Highlight the value of ridge regression with an extremely ill-conditioned simulated data set (such as the one shown below).
3. Divide the data into training, test, and validation data sets. You must use the same training, test, and validation data sets used in previous TQ3, TQ4 & TQ5.
4. Determine the appropriate regularization coefficient through two methods: The L-curve method and Cross-Validation. Compare and explain any difference in the results of the two methods.
5. Select the best ridge regression model from the two you trained in the prior step. Explain why it is the best model in terms of both accuracy and stability.
6. Compare the validation performance of your ridge regression model to the PLS, PCR, and regression models. Comment on the results.