Please complete the following steps before the beginning of the workshop on Friday, March 19! The goal is to set up and familiarize yourself with a Python environment that is suitable working with data sets and machine learning methods.

Do these things until the workshop on Friday, March 19:

Download and install Anaconda (Individual Edition), which is a Python distribution that includes the packages which we will use in this workshop. We suggest working with the integrated development environment (IDE) Spyder, which is already included in Anaconda.
Revisit what you learnt about linear regression in the Workshop on Statistical Learning in February. For the “Salaries” dataset that you got to know in Chapters 1 and 2 of the workshop, use R to create a scatter of the data corresponding to the categories “yrs.since.phd” and “salary” with a color-encoded category “discipline”.
Find and plot the two linear regression lines of the dataset for the categories “yrs.since.phd” and “salary”, conditioned on the variables of the category “discipline”, respectively.

(The code for the last two tasks has been discussed in Section 2.5 of Dr. McNamara’s Feburary workshop material.)

Try to reproduce the last to steps using Python: You can use this link for a .csv file of the dataset “Salaries” that can be imported via the package pandas. You can use this file to get started.
The following functions might be helpful for this task:

sklearn.linear_model.LinearRegression.fit, cf. scikit-learn documentation.
pandas.DataFrame.plot.scatter, cf. pandas documentation.
matplotlib.pyplot.plot, cf. matplotlib documentation.

(optional): Read the article 50 Years of Data Science by David Donoho. According to Donoho, what is a crucial (and sometimes under-appreciated) paradigm that led to the development of technologies like smartphone voice recognition or machine translation?

Preparation for the Workshop

Do these things until the workshop on Friday, March 19:

Solution files for 3. and 4.: