#!/usr/bin/env python
# coding: utf-8

# # Regression on the [Boston housing dataset](https://www.openml.org/d/531)
# ## Case study with a real data set
# 
# In the following, we study and compare the performance of different regression methods applied to a data set with 13 regressors and one target variable.
# 
# For this data set, we cover the
#   - visualization of the data,
#   - preprocessing of the data,
#   - differences between learning algorithms, in particular, between ridge regression and sparse regression (Lasso and OMP), 
#   - cross-validation of an algorithmic hyperparameter. 

# ## Understanding the dataset
# 
# First, we load the data. It is available in the data sets included in [scikit-learn](https://scikit-learn.org/stable/). We create a [pandas](https://pandas.pydata.org/docs/) DataFrame from the data as it helps to visualize it in the following.

# In[ ]:


import pandas as pd
from sklearn.datasets import load_boston
# %% Load data set
boston = load_boston()
print(boston.data.shape) #get (numer of rows, number of columns or 'features')
print(boston.DESCR) #get a description of the dataset
df_boston = pd.DataFrame(boston.data,columns=boston['feature_names'])
df_boston = df_boston.assign(MEDV=boston.target)


# Generate some descrive statistics of dataset:

# In[ ]:


df_boston.describe()


# ## Visualization
# 
# Create histrogram of the target variable, the median house prices, using [matplotlib](https://matplotlib.org):

# In[ ]:


import matplotlib.pyplot as plt
# %% Plot histogram of median house prices in dataset


# We now use the package [seaborn](https://seaborn.pydata.org) for the creation of a pairwise scatter plots of all variables. While matplotlib is powerful, seaborn has a quite accessible API and is designed to create plots that are helpful for statistical interpretation of data.

# In[ ]:


# %% Use alternative plotting package seaborn for figure with all pairwise scatter plots
import seaborn as sns


# In[ ]:


# Create scatter plots of all regressor variables with the target variable (median house price)
snsplot = sns.PairGrid(df_boston,y_vars='MEDV',x_vars=df_boston.columns,aspect=1,height=6,layout_pad=1)
snsplot.map(sns.scatterplot)


# ## Preprocessing
# 
# The regressor variables cover varying scales: While the variable 'CRIM', the per capita crime rate ranges between 0.006320 and 88.97, the variable 'NOX' (nitric oxides concentration in parts per 10 million) has a range between 0.385 and 0.871. This will have implications on the magnitudes of the coefficients in the linear models that are used by the learning algorithms we consider.
# 
# In order to mitigate the issue of these varying scales, we preprocess the data by scaling all feature/regressor variables to lie between 0 and 1.
# 
# **Exercise:** Repeat experiments below _without_ this preprocessing, or with a different preprocessing. How does the performance change?

# In[ ]:


# %% rescale regressor coordinates (improves regression performance by levelling the regression coefficient sizes)
# create numpy arrays representing features and target values
X = 
y = 
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
scaler = # define scaling object
X_scaled = # apply scaling to X


# ## Data split for experimental setup
# 
# We split our data set into two parts, a training and a test set. The test set contains the a share of 'test_share' of the total data points.

# In[ ]:


from sklearn.model_selection import train_test_split
test_share = 0.2

X_train, X_test, y_train, y_test = train_test_split(X_scaled,y,random_state=10,test_size=test_share)


# ## Run learning algorithms
# 
# ### Ridge Regression for fixed regualarization
# 
# We run ridge regression for a fixed regularization parameter on the training set. By defining two functions for evaluating the performance of the model and for the visualization of a correlation between predicted target variable and true target variable, we not only achieve what we want for the ridge regression model, but can easily reuse the code for other learning algorithms below.

# In[ ]:


# Compute prediction scores for training and test data for model, plot correlation matrix of expected vs. "measured" median housing price
def plot_correlation_true_predicted(y_train,y_test,model): # Plot correlation matrix of expected vs. "measured" median housing price for model
    plt.figure(figsize=(14,14))
    plt.scatter(y_train,model.predicted_train)
    plt.scatter(y_test, model.predicted_test)
    ax=plt.gca()
    ax.set_ylim(0, 1.02*max(np.max(model.predicted_train),np.max(model.predicted_test)))
    ax.set_xlim(0, 1.02*max(np.max(y_train),np.max(y_test)))
    ax.legend(["training data","test data"], loc=0)
    ax.plot(ax.get_xlim(), ax.get_ylim(), ls="--", c=".3")
    ax.set(ylabel='predicted median housing price (in $1000)', xlabel='true median housing price (in $1000)')
    ax.set(title="Correlation between true and predicted prices for model "+str(model))
    

def eval_prediction_train_test(model,X_train,X_test,y_train,y_test,plot=True):
    model.score_train = model.score(X_train, y_train)
    model.score_test = model.score(X_test, y_test)
    model.predicted_test = model.predict(X_test)
    model.predicted_train = model.predict(X_train)
    print("R^2 value of model",str(model),"on the train set: %f" % model.score_train)
    print("R^2 value of model",str(model),"on the test set: %f" % model.score_test)
    if plot:
        plot_correlation_true_predicted(y_train,y_test,model)
    return model


# We define a ridge regression object for a regularization parameter of $\alpha=1$, and train it on the training set. We then call the function _eval_prediction_train_test_ to obtain the $R^2$ values and to output a "correlation matrix" between true and predicted prices.

# In[ ]:


# %% Set up ridge regression model
alpha_ridge = 1


# fit ridge model to training data

ridge = eval_prediction_train_test(ridge,X_train,X_test,y_train,y_test)


# In this plot, the closeness of data points to the main diagonal visualizes the accuarcy of the prediction.

# ### Ridge regression with cross-validated regualarization  paramter (hyperparameter optimization via grid search)
# 
# Next, we optimize the regularization parameter alpha of ridge regression. We use a grid of logarithmically spaced values between $10^{-5}$ and $10^{-9}$, and a 5-fold [cross validation](https://scikit-learn.org/stable/modules/cross_validation.html#cross-validation). 

# In[ ]:


# %% Use cross validation to optimize regularization parameter alpha of ridge regression
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import Ridge
alphas=np.logspace(-5,9,num=60) # create vector of logarithmically interpolated values between 10^(-5) and 10^(9)
parameters = {'alpha':alphas}

def plot_training_test(model,parameters,train_errors,test_errors): # Plot training und test errors for different model complexities of ridge regression
        plt.figure(figsize=(8,8))
        plt.plot(list(parameters.values())[0],train_errors)
        plt.plot(list(parameters.values())[0],test_errors)
        ax = plt.gca()
        ax.set_xscale('log')
        ax.set(xlabel='alpha', ylabel='R2 value',
           title=str(model)+' for different model complexities')
        ax.set_box_aspect(1)
        ax.legend(["training data","test data"], loc=0)

def gridsearch_crossvalidation_model(model,X,y,parameters,cv=5,plot=True):
    gridsearch = GridSearchCV(model,param_grid=parameters,scoring='r2',return_train_score=True,cv=cv) # cv
    gridsearch.fit(X, y)
    gridsearch.train_errors = gridsearch.cv_results_['mean_train_score']
    gridsearch.test_errors  = gridsearch.cv_results_['mean_test_score']
    if plot:
        plot_training_test(model,parameters,gridsearch.train_errors,gridsearch.test_errors)
    return gridsearch

gridsearch_ridge = gridsearch_crossvalidation_model(Ridge(),X_scaled,y,parameters)
ridge_optimized = gridsearch_ridge.best_estimator_


# We plot the true prices vs. the prices predicted by the "optimized" ridge regression model.

# In[ ]:


ridge_optimized = eval_prediction_train_test(ridge_optimized,X_train,X_test,y_train,y_test)


# In the first plot, we see the $R^2$ value (which can be thought of as a value that is negatively correlated to the mean squared error, or empirical risk in our case). The regularization parameter $\alpha$ to be used in our predicive model is the one that maximizes the $R^2$ value on the test set (in this context, the term *validation set* is more appriopriate).

# ### Lasso regression with cross-validated regualarization  paramter (hyperparameter optimization via grid search)
# 
# We now repeat the procedure described above for Ridge regression for [Lasso regression](https://web.stanford.edu/~hastie/StatLearnSparsity/) ("Least absolute shrinkage and selection operator"), which promotes _sparsity_ in the coefficients of the linear model. As the learned model tends to have a sparse coefficient vector, it leads to a better _interpretability_ of the model as we can identify with features are used in the model in which are not.

# In[ ]:


# %% Repeat previous experiments with Lasso instead of ridge regression
alpha_lasso = 0.5


print("Number of features used for model "+str(lasso)+": {}".format(np.sum(lasso.coef_ != 0)))

# Run cross validation to optimize regularization parameter alpha of Lasso


print("Number of features used for model "+str(lasso_optimized)+": {}".format(np.sum(lasso_optimized.coef_ != 0)))


# We ran Lasso first for $\alpha=0.5$, which leads to a model with few non-zero coefficients (how many?). However, we see that the resulting errors are larger, i.e., the $R^2$ values for test and training set are smaller than for ridge regression.
# 
# Optimizing $\alpha$ by cross validation leads to a model with more non-zero coefficeints, but better predictive properties (also improving on ridge regression in this case).

# ### Orthogonal matching pursuit
# 
# As an alternative to Lasso regression, we use [Othogonal Matching Pursuit (OMP)](https://scikit-learn.org/stable/modules/linear_model.html#omp), a greedy method for the sparse subset selection or sparse regression problem.
# 
# This algorithm allows to specify the number of features we want to select ('n_omp'). We choose $n_{\text{omp}}=8$.
# 

# In[ ]:


# %% OMP
from sklearn.linear_model import OrthogonalMatchingPursuit
n_omp = 8
    # define the omp object
    # fit to training data
omp = eval_prediction_train_test(omp,X_train,X_test,y_train,y_test)


# We summarize the prediction quality of the different learning algorithms:

# In[ ]:


# %% Summarize results
from sklearn.metrics import mean_squared_error
results = {'Ridge': [ridge_optimized.score_test, mean_squared_error(ridge_optimized.predict(X_test), y_test)], 
           'Lasso': [lasso_optimized.score_test,mean_squared_error(lasso_optimized.predict(X_test), y_test)] , 
           'OMP': [omp.score_test,mean_squared_error(omp.predict(X_test), y_test)]}
df_results = pd.DataFrame(data=results,index=['R2 on test set','Test MSE']) 
print(df_results)


# To explore the interpretability of our models, we plot the regression coefficients.

# In[ ]:


# %% Plot coefficients for different models
plt.figure(figsize=(14,14))
plt.plot(lasso_optimized.coef_, '^', label=str(lasso_optimized))
plt.plot(ridge_optimized.coef_, 'v', label=str(ridge_optimized))
plt.plot(omp.coef_, 'o', label="OMP")
plt.legend(ncol=2, loc=(0, 1.05))
#plt.ylim(-25, 25)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
plt.xticks(ticks=np.arange(0,13),labels=df_boston.columns[0:13])
ax = plt.gca()
#ax.set_aspect(1)


# We note that for OMP with our choice of non-zero coefficients, we can see that the learned model suggests that the target variable (the median house price in a certain area) is strongly positively correlated with the variable 'RM' (average number of rooms per dwelling) and 'ZN' (proportion of residential land zoned for lots over 25,000 sq.ft), but negatively correlated with 'LSTAT' and 'DIS' (weighted distances to five Boston employment centers).
# 
# This interpretability is less pronounced for Ridge (and even for Lasso with the $\alpha$ parameter that was optimized for prediction, not for interpretability).

# ## Regression on more complex features
# 
# The $R^2$ values above still suggest that there is room for improvement to achieve better predictions, if we choose a more flexibel model / better learning algorithm.
# 
# We want to combined the above ideas with the usage of **polynomial features** (as seen in the wave regression example). Let's try first to use polynomial features of degree $3$ - later, we can play around with this value

# In[ ]:


from sklearn.preprocessing import PolynomialFeatures
# Create regressor variables with polynomial features
degree_polynomial = 3
poly = 
X_train_poly = 
X_test_poly = 

# Rescale variables
scaler = 
X_train_poly = 
X_test_poly = 

# compare the shape of original training set with training set of preprocessing
print("X_train.shape: {}".format(X_train.shape)) 
print("X_train_poly.shape: {}".format(X_train_poly.shape))


# In[ ]:


# Run ridge regression for fixed alpha regularization
alpha_ridge = 1
ridge = 

ridge = eval_prediction_train_test(ridge,X_train_poly,X_test_poly,y_train,y_test)


# In[ ]:


# Run ridge regression for cross validated regularization parameter alpha


ridge_optimized = eval_prediction_train_test(ridge_optimized,X_train_poly,X_test_poly,y_train,y_test)


# In[ ]:


# Run lasso for cross validated regularization parameter alpha.

# We suggest that you choose the option "warm_start=True" and "max_iter=1000" in Lasso to make the calculation feasible.

from sklearn.linear_model import Lasso


lasso_optimized = eval_prediction_train_test(lasso_optimized,X_train_poly,X_test_poly,y_train,y_test)
print("Number of features used for model "+str(lasso_optimized)+": {}".format(np.sum(lasso_optimized.coef_ != 0)))


# For Lasso regression with cross-validated regularization parameter $\alpha$, we can now look at the _sparsity_, i.e., the number of non-zero coefficients in the coefficient vectors:

# In[ ]:


# vars(lasso_optimized)


# In[ ]:


lasso_optimized.coef_


# We see that the this coefficient vector has a lot of zero entries and just few-nonzero ones. In particular, while the dimensionality of the vector is

# In[ ]:


np.shape(lasso_optimized.coef_)


# the number of non-zeros is

# In[ ]:


np.count_nonzero(lasso_optimized.coef_)


# On the extended feature set, we now also run orthogonal matching pursuit.

# In[ ]:


# Run OMP
from sklearn.linear_model import OrthogonalMatchingPursuit
n_omp = 40 # Here, we choose this number as quite small number of target non-zero coefficients. The number itself is quite arbitrary, and can be also optimized via cross-validation.


omp = eval_prediction_train_test(omp,X_train_poly,X_test_poly,y_train,y_test)


# Summarizing the performance of the different regression methods using polynomial features, we obtain:

# In[ ]:


# Summarize results for models trained on new features
from sklearn.metrics import mean_squared_error
results_poly = {'Ridge': [ridge_optimized.score_test, mean_squared_error(ridge_optimized.predict(X_test_poly), y_test)], 
           'Lasso': [lasso_optimized.score_test,mean_squared_error(lasso_optimized.predict(X_test_poly), y_test)] , 
           'OMP': [omp.score_test,mean_squared_error(omp.predict(X_test_poly), y_test)]}
df_results_poly = pd.DataFrame(data=results_poly,index=['R2 on test set','Test MSE']) 
print(df_results_poly)


# Comparing it with the $R^2$ values above, we see that working with polynomial features clearly improved the performance!

# In[ ]:


Finally, we create a plot that visualizes the differences between the coefficients returned by the methods.


# In[ ]:


# %% Plot coefficients for different models
plt.figure(figsize=(14,14))
plt.plot(lasso_optimized.coef_, '^', label=str(lasso_optimized))
plt.plot(ridge_optimized.coef_, 'v', label=str(ridge_optimized))
plt.plot(omp.coef_, 'o', label="OMP")
plt.legend(ncol=2, loc=(0, 1.05))
#plt.ylim(-25, 25)
plt.xlabel("Coefficient index")
plt.ylabel("Coefficient magnitude")
ax = plt.gca()
#ax.set_aspect(1)


# We observe that the coefficient vector returned by ridge regression has many small, but non-zero coefficients, whereas Lasso and OMP have few non-zero, but larger coefficients.
# Furthermore, for the given choice of parameters, we see that OMP has even few non-zero parameters than Lasso.
# 
# 
# Relating the coefficient index numbers with the original polynomial features created, we the see therefore that the sparse regression methods (Lasso and OMP) given us further options to _interpret_ model:
# 
# Each coefficient index corresponds to a certain monomial created from the 13 variables

# In[ ]:


boston['feature_names']


# so that we can learn which (combinations of) features are able to create a good predictive model. The fewer non-zero coefficients we have, the "simpler" and "easier to interpret" the model becomes.

# In[ ]: