# Regression model using MATLAB

The table attached (see excel file) contains data related to performance and success statistics for LPGA golfers in 2009. The matrix X contains 11 predictor variables:

1. Average drive (yards)

2. Percent of fairways hit

3. Percent of greens reached in regulation

4. Average putts per round

5. Percent of sand saves (2 shots to hole)

6. Tournaments played in

7. Green in regulation putts per hole

8. Completed tournaments

9. Average percentile in tournaments (high is good)

10. Rounds completed

11. Average strokes per round

The column vector y contains the output variable, prize winnings ($1000s). For each variable in x and y.

1. Divide the data into training, test, and validation data sets and describe how you divided the data into these three sets and why it is appropriate.

2. Develop a linear regression model to predict the prize winnings ($1000s) using the 11 other variables. Then, select a subset of these variables and develop a competing model. Use the correlation coefficients to explain why this representation works. Remember to pad your inputs with a column of ones to allow for a non-zero y-intercept.

3. Find the single best predictor of prize winnings ($1000s) and the pair of predictors that performs best.

4. Identify or derive any additional inputs that may be good predictors of prize winnings ($1000s) based on nonlinear relationships of the available predictors. Use linear-in-parameters regression with the non-linear term(s) to evaluate any model improvement.

5. Compare the performance of all models using the root mean squared error (RMSE) of the test data set. Select the best model. Explain why it is the best.

6. Find the validation error of your best model.