# Pre-Work for the Inmas Workshop on Data Science, January 2022.

In the following exercise, we recap some knowledge from the [Inmas Statistical Methods Workshop](https://inmas-training.github.io/fa21-modeling-workshop/syllabus.html) of November 2021 to prepare for the Workshop on Data Science and Machine Learning on Saturday, January 15 and Sunday, January 16.

### Scikit-Learn

During the workshop weekend, we will work a lot with [scikit-learn](https://scikit-learn.org/stable/), an open-source Python library that makes many standard machine learning models and methods readily accessible. 

As outlined [here](https://scikit-learn.org/stable/testimonials/testimonials.html#), scikit-learn is a toolkit popular among industry companies as well.

### Revisit Linear Regression using scikit-learn

In November, we familiarized ourselves with linear regression models, and applied it via the `statsmodels` [library](https://www.statsmodels.org/stable/index.html) to the "[Salaries](https://raw.githubusercontent.com/inmas-training/fa21-statistical-methods-workshop/main/data/Salaries.csv)" dataset that contains data about the salaries of academic professionals.

The data set contains 397 observations on the following 6 variables.

- rank: a factor 
  - AssocProf, AsstProf, Prof
- discipline: a factor
  - A (“theoretical” departments) or B (“applied” departments).
- yrs.since.phd: integer
  - years since PhD.
- yrs.service: integer
  - years of service.
- sex: a factor
  - Female Male
- salary: number
  - nine-month salary, in dollars.

We recall that previously, we fitted a multiple linear regression model to the data set using 'salary' as a repsonse variable and the number of years since Ph.D. and the number of years in service ('yrs.since.phd' and 'yrs.service') as predictor variables.

Mathematically, this means that we compute the 'best' (in $
\ell_2$-sense) coefficients $\beta_0,\ldots,\beta_p$ in the model
$$
\begin{align}
y_i &= \beta_0 + X_{i,1}\beta_1 + \cdots + X_{i,p}\beta_{p}
\end{align}
$$
where $y_i$ corresponds to the salary of the $i$-th data sample, and the $X_{i,j}$ corresponds to the $j$-th predictor variable of the $i$-th data sample.


We recall the code we used below.

In [None]:
import pandas as pd
import numpy as np
import statsmodels 
import statsmodels.api as sm
import statsmodels.formula.api as smf
import matplotlib
import matplotlib.pyplot as plt

url = "https://raw.githubusercontent.com/inmas-training/fa21-statistical-methods-workshop/main/data/Salaries.csv"
Salaries = pd.read_csv(url)
display(Salaries)

In [None]:
results = smf.ols('salary ~ Q("yrs.since.phd") + Q("yrs.service")', data = Salaries).fit()
results.params

So this tells us the fitted values for $\beta_0$, the intercept, and $\beta_1$ and $\beta_2$, corresponding to the resulting coefficients of the two predictor variables, respectively.

## Exercise 1

**Use the toolkit [scikit-learn](https://scikit-learn.org/stable/) to perform the same task as above, i.e., to fit the same multiple linear regression model. Print the resulting two coefficients and the intercept.**

Useful for this can be the module 
[sklearn.linear_model.LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression) and [this example](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py).

In [None]:
from sklearn.linear_model import LinearRegression

### add your code below ###

Of course, we will learn how to use more powerful models than just linear regression, but the syntax and user interface will be similar.

## Exercise 2

We have learned to use the `seaborn` library for a variety of data visualizations. Let us review our knowledge.

In [None]:
import seaborn as sns

**Plot the salaries of assisstant professors, associate professors and professors with different colors versus the years since Ph.D., including their respective regression lines, with just one line of code.**

In [None]:
### add your code below ###