# -*- coding: utf-8 -*-
"""workshop-classification.ipynb

Automatically generated by Colaboratory.

Original file is located at
    https://colab.research.google.com/drive/1VpmOXMxaE08ZjZp7hRukGaDjH9andk0X

In this workshop we will explore classification on two data sets: the famous MNIST dataset containing handwritten digits 0 through 9, and the Fashion MNIST dataset containing 10 different types of clothing.

Let's begin by importing our common libraries as well as the datasets. Notice that we set the numpy seed to 0 for reproducibility. It's a good idea to set random seeds to specific values while you are experimenting.
"""

import numpy as np;  np.random.seed(0)
import pandas as pd
import matplotlib.pyplot as plt

from tensorflow.keras.datasets import mnist, fashion_mnist

import warnings
warnings.filterwarnings("ignore")

"""This notebook works eqully well with the MNIST and Fashion MNIST datasets. You are encouraged to run your code on both, switching using the fashioon = True flag below.

Note that the load_data function returns a separate train and test set. I highly recommend only looking at training data. In real life this is the data that you actually have, and in many cases the test data doesn't even exist yet. 

A huge problem in machine learning is leaking testing data into training data. You'll find that models are REALLY good at predicting data that they've already seen. In our case this leakage into the model is unlikely to happen since we were given preseparated data (though how well do you trust the person who gave you it?), but I assure you it definitely happens when you are separating data yourself. Even if you don't leak data directly into your models you can still have indirect leakage because any knowledge you have about the test data from taking a peak could influence how you design your models.

I HIGHLY reccomend you scrutinize your data separation over and over again for any nontrivial data leaks in any project you undertake.
"""

fashion = True
if fashion: (X_train, Y_train), (X_test, Y_test) = fashion_mnist.load_data()
else: (X_train, Y_train), (X_test, Y_test) = mnist.load_data()

"""Taking a look at the labels Y_train you'll notice that they are digits from 0 to 9. This is great for the MNIST dataset, but less so for Fashion MNIST.

The Fashion MNIST website https://github.com/zalandoresearch/fashion-mnist provides a dictionary between labels and clothing types, which we will put here for ease of interpretability.


"""

if fashion : label_dict= {
0 : "tshirt",
1 : "pants",
2 : "sweater",
3 : "dress",
4 : "long sleeve",
5 : "sandal",
6 : "jacket",
7 : "sneaker",
8 : "bag",
9 : "shoe"}

else:
  # for regular MNIST the dictionary looks like i : "i"
  label_dict = dict(zip(range(10), map(str, range(10))))

"""Let's plot some images and see if we agree with the labels.

Here we are using the object oriented matplotlib interface. A comparison between the OO and traditional interfaces can be found at https://matplotlib.org/matplotblog/posts/pyplot-vs-object-oriented-interface/
"""
# %% 
n_row = 3
n_col = 5
fsize = (12,10)

fig, axs = plt.subplots(n_row, n_col, figsize=fsize)

for i, ax in zip(label_dict, axs.ravel()):

  # print the label and remove axis ticks
  # since these are images
  ax.set_title("{}: {}".format(i, label_dict[i]))
  ax.set_xticks([])
  ax.set_yticks([])

  # try to parse this. for each i we find a
  # random image in X_train which has label i
  ax.imshow(X_train[np.random.choice(np.argwhere(Y_train == i).flatten())], 
            cmap='gray' )

# make the rows closer together since we have 3 
# but only 2 rows have images
for ax in axs.ravel()[len(label_dict):]:
  ax.set_visible(False)
  
plt.show()

"""Let's get a sense for the size and shape of our data; how many images are in the training set and what are their dimensions?

Because this is a multilabel classification problem we also want to know how balanced the labels are. Our approach could be very different if there are an equal number of each class versus if we have 20000 shirts and only 4 pairs of pants (me IRL). The problem of unequal class sizes is often called 'class imbalance' and you can find many good articles and bad blog posts on how to handle it. 
"""

X_train.shape

"""A good way to look for label imbalence when we only have a small number of labels is using a bar graph. We can count the number of each label using np.unique with the parameter return_counts=True"""

labels_train, counts_train = np.unique(Y_train, return_counts=True)
plt.bar(labels_train, counts_train)
plt.show()

"""Because our data are images they are stored in 3D arrays. There are many ways to approach machine learning problems on images, and we will use a basic one here. We will think of each image as a string of pixel values with no spatial relation, allowing us to flatten it into a vector of length width*height. We can flatten our entire 2D array into a 3D array using the reshape method."""

im_dims = X_train.shape[1:]

X_train = X_train.reshape(-1, np.prod(im_dims))
X_test = X_test.reshape(-1, np.prod(im_dims))

"""Ok let's start learning! As we have mentioned, we don't want to look at the testing data, but we do need a way to evaluate our models.

One common method is to separate a chunk of our training data into "development" data and use this to evaluate our models trained on the remainder of the training data. This can be accomplished using the train_test_split function form Sci-Kit Learn's model selection module.
"""

# we will import everything we need from this module here,
# don't worry about the other functions yet
from sklearn.model_selection import train_test_split, KFold, cross_validate, GridSearchCV

x_train, x_dev, y_train, y_dev = train_test_split(X_train, Y_train, test_size=0.3)

"""Our training data has a certain distribution of labels (if it was sampled well) which models the natural distribution of whatever data we expect to find in the wild. This doesn't neccesarily mean that this distribution is uniform, but we should keep this in mind whenever we sample our data as it plays a big role in how our models generalize.

In general, models tend to work better on more abundant labels. This may or may not be desireable, and one should look into ways to deal with class imbalance for their specific case.

Let's say we are using the MNIST digit dataset. train_test_split provided us with a random sampling of evaluation data and training data, but suppose we were really unlucky and our training data consisted of only 2s and 3s, while our evaluation data was all 8s and 9s. In this case our models would be pretty good at recognizing 2s and 3s, but wouldn't score very well on the evaluation data.

Of course this is unlikely to happen for us, but in many cases of great class imbalance train_test_split may miss or underrepresent classes. Let's check to see if our train labels y_train and our evaluation labels y_dev have the same distribution.
"""

# bar graph labels distribution for train
labels_train, counts_train = np.unique(y_train, return_counts=True)
plt.bar(labels_train, counts_train)
plt.show()

# bar graph labels distribution for train
labels_dev, counts_dev = np.unique(y_dev, return_counts=True)
plt.bar(labels_dev, counts_dev)
plt.show()

# ratio of class labels in train and in dev
(counts_dev/counts_dev.sum()) / (counts_train/counts_train.sum())

"""The distributions should look pretty similar, and this is good enough for us. For the sake of demonstration, if one wanted to force the distributions to be exactly the same, train_test_split excepts a stratify parameter that allows it to stratify the split by a given label distribution such as the labels Y_train."""

x_train, x_dev, y_train, y_dev = train_test_split(X_train, Y_train, test_size=0.3, stratify=Y_train)

# distribution of stratified y_train
labels_train, counts_train = np.unique(y_train, return_counts=True)
plt.bar(labels_train, counts_train)
plt.show()

# distribution of stratified y_dev
labels_dev, counts_dev = np.unique(y_dev, return_counts=True)
plt.bar(labels_dev, counts_dev)
plt.show()

# ratio of class labels in train/dev very close to 1
(counts_dev/counts_dev.sum()) / (counts_train/counts_train.sum())

"""In order to reduce runtime of our models in this workshop, we will modify our datasets to toy versions by downsampling all of our images, removing every other row and column, and taking a subset of the Train and Test data. 

This kind of downsampling may be useful while you experiment, but of course you should make sure your results generalize to your real dataset.
"""

# delete every other row and column of images
X_train = X_train[:, np.vstack([2*i*28 + np.arange(0,28) for i in range(0, 14)]).ravel()]
X_train = X_train[:, [2*i for i in range(X_train.shape[-1]//2)]]

X_test = X_test[:, np.vstack([2*i*28 + np.arange(0,28) for i in range(0, 14)]).ravel()]
X_test = X_test[:, [2*i for i in range(X_test.shape[-1]//2)]]

# throw away 80% of our data
X_train, _, Y_train, _ = train_test_split(X_train, Y_train, train_size=0.2, stratify=Y_train, random_state=0)
X_test, _, Y_test, _ = train_test_split(X_test, Y_test, train_size=0.2, stratify=Y_test, random_state=0)

"""Let's take a quick look to make sure our downsampled toy problem still looks reasonable. Note the images are now 14x14"""

n_row = 3
n_col = 5
fsize = (12,10)

fig, axs = plt.subplots(n_row, n_col, figsize=fsize)

for i, ax in zip(label_dict, axs.ravel()):
  ax.set_title('%d: %s' % (i, label_dict[i]))
  ax.set_xticks([])
  ax.set_yticks([])
  ax.imshow(X_train[np.random.choice(np.argwhere(Y_train == i).flatten())].reshape(14,14), 
            cmap='gray' )

for ax in axs.ravel()[len(label_dict):]:
  ax.set_visible(False)
  
plt.show()

"""We are ready to import our models. In this workshop we will use multinomial Logisitic Regression and K Nearest Neighbors classifiers.

There are many metrics one can use on a classification problem, indeed in the real world you'll often find that choosing the correct metric is THE problem itself. For now we will stick to accuracy.

Why might you want a different metric? Suppose you have a binary classification problem with 90% of your labels 0 and 10% 1. A model which always predicts 0 will have a great score of 90%! In this case you might chose a different metric which takes this imbalance into account, for example the f1 score. Worse yet, maybe a 0 is "all normal" and 1 is "catastrophic failure", then you better be sure you get those 1 predictions correct...
"""

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, make_scorer

"""Let's stratified split our data, even though stratification isn't really neccesary due to our balanced classes."""

x_train, x_dev, y_train, y_dev = train_test_split(X_train, Y_train, test_size=0.3, stratify=Y_train)

"""Surprisingly, the easy part of data science is in running and fitting the models. We will fit a logisistic regression model and evalue it on train and dev data."""

# due to defaults in sklearn LogisticRegression
# we need to pass in penalty='none', multi_class='multinomial'
# to get the version described in class
lr = LogisticRegression(penalty='none', multi_class='multinomial')
lr.fit(x_train, y_train)
y_train_lr_pred = lr.predict(x_train)
y_dev_lr_pred = lr.predict(x_dev)

"""Now we can compute accuracies of our predictions on each dataset. You should find that the training accuracy is higher than the development accuracy.

The discrepency between these two scores is important to keep in mind. A much higher training score could imply overfitting of your model to the training data, suggesting a form of regularization. A much higher dev accuracy score might make you rethink if your evaluation dataset and trainign dataset were sampled well.
"""

acc_train_lr = accuracy_score(y_train, y_train_lr_pred)
acc_dev_lr = accuracy_score(y_dev, y_dev_lr_pred)

print("Logistic Regression: Train Accuracy: {:.4f}, Dev Accuracy: {:.4f}".format(acc_train_lr, acc_dev_lr))

"""If fitting models was the easy part of data science, one of the hardest parts is figuring out why models are giving you certain results. If your model didn't give you 100% accuracy then it is making mistakes. Are these mistakes random? Is there a pattern to them? Does the model make the same mistake everytime?

For example in the MNIST digit dataset you might find that logistic regression mixes up 3s and 8s a lot, or 1s and 7s. This is pretty reasonable since handwritten versions of these digits look similar. Is there anything you can do about this? Should you be worried?

We can use the sci-kit learn confusion_matrix to see exactly how our predictions are going wrong.
"""

confusion_frame = pd.DataFrame(confusion_matrix(y_dev, y_dev_lr_pred), index=label_dict.values(), columns=label_dict.values())

confusion_frame

"""Now let's run a different classification algorithm. The K Nearest Neighbors classifier. Feel free to choose how many neighbors you want to consider."""

n_neighbors = 5

knn = KNeighborsClassifier(n_neighbors)
knn.fit(x_train, y_train)
y_dev_pred = knn.predict(x_dev)

y_train_pred = knn.predict(x_train)
y_dev_pred = knn.predict(x_dev)

acc_train = accuracy_score(y_train, y_train_pred)
acc_dev = accuracy_score(y_dev, y_dev_pred)

print("KNN {}: Train Accuracy: {:.4f}, Test Accuracy: {:.4f}".format(n_neighbors, acc_train, acc_dev))

"""Depending on which dataset you used you might find that K Nearest Neighbors does about the same as Logistic Regression or much better.

When one algorithm does significantly better than another it is worth thinking about why that algorithm should be better on your data. For example perhaps you are using image data and one algorithm takes into account the spatial properties of the image (this doesn't apply to us in the current state because we flattened our images).

In the case that two models do about the same it becomes more complicated to decide which is better. There is some noise in our evaluation accuracy depending on which portion of the data was randomly assigned to x_dev. 

The most common way to deal with this is through Cross Validation, and the most popular form of crossvalidation is Kfold cross validation. In Kfold cross validation the data is separated into k chunks. For each of these chunks a model is trained on the other k-1 chunks and evaluated on the helf out chunk. Because each data point is assigned to only one chunk it has a unique evaluation score, and these scores are used to determine the effectivness of the model.

Sci-kit learn offers both normal and Stratified version of Kfold cross validation. It is really easy to write your own version of cross validation, and in many cases your data will require a non trivial form of validation that will require you to do this.

Let's use 5fold cross validation on our Logistic Regression model via sci-kit learn's Kfold object.
"""

# shuffling data is often a good idea
# you dont want all your pants in a group
# and all of your shirts in another
cv = KFold(5, shuffle=True)

# note when we use cross validation we
# pass in the whole X_train instead of 
# splitting into x_train, x_dev
for train_index, dev_index in cv.split(X_train, Y_train):
  x_train, x_dev = X_train[train_index], X_train[dev_index]
  y_train, y_dev = Y_train[train_index], Y_train[dev_index]

  lr = LogisticRegression(penalty='none', multi_class='multinomial')
  lr.fit(x_train, y_train)
  y_train_lr_pred = lr.predict(x_train)
  y_dev_lr_pred = lr.predict(x_dev)

  acc_train_lr = accuracy_score(y_train, y_train_lr_pred)
  acc_dev_lr = accuracy_score(y_dev, y_dev_lr_pred)

  print("Logistic Regression: Train Accuracy: {:.4f}, Test Accuracy: {:.4f}".format(acc_train_lr, acc_dev_lr))

"""Because Kfold cross validation loops all look the same, sci-kit learn offers a higher level function called cross_validate to take care of this for you. All you have to do is pass in an estimator, a cross validation strategy and a dictionary of scores."""

lr = LogisticRegression(penalty='none', multi_class='multinomial')
cv = KFold(5, shuffle=True)
scores = {'acc' : make_scorer(accuracy_score)}

cv_results = cross_validate(X=X_train, y=Y_train, estimator=lr, cv=cv, scoring=scores)

"""cross_validate produces a helpful table of times and test scores. Notice that the scores for each fold vary slightly. When we cross validate a model we are looking not only for high scores but for CONSISTENCY of scores.

If your results across folds looke like 90%, 90%, 90%, 60%, 90%, your job as a data scientist is to figure out what happened in that fourth fold to throw your model so far off. 

In general consistent cross validation scores will lead to results on the test data which are also consistent with what you've seen so far. 
"""

pd.DataFrame(cv_results).round(4)

"""Before we cross validate our KNeighborsClassifier, note that we arbitrarily set n_neighbors=5. Was this a good choice? We should try various values of this parameter to see what the optimal value is i.e the one that achieves the highest cross validation score. By hand this could be done in a nested loop, with the outer loop running through choices of n_neighbors and the inner loop through folds.

Luckily for us sci-kit learn packages this up into the super useful GridSearchCV object. GridSearchCV takes an estimator, a cross validation strategy, and a dictionary of parameters and finds the best parameters for the estimator. Unlike cross_validate, GridSearchCV acts as an estimator itself with fit and predict methods, so you dont need to retrain a new estimator with the optimal paramters. What a bargain!
"""

knn = KNeighborsClassifier()
params= {"n_neighbors" : [3,5,10]}
cv = KFold(3, shuffle=True)

gcv = GridSearchCV(estimator=knn, cv=cv, param_grid=params)

"""Fitting the GridSearchCV estimator might take a minute..."""

# Commented out IPython magic to ensure Python compatibility.
# %time gcv.fit(X_train, Y_train)

"""The GridSearchCV object holds a dictionary cv_results_ with full results from the fitting which you should look at carefully for consistency."""

gcv.cv_results_

"""best_estimator_ holds the information for the optimal configuration of parameters found by the gridsearch."""

gcv.best_estimator_

"""Now that we have a good idea of where Logistic Regression and K Nearest Neighbors stand, let's train them on the entire training data and evaluate them on the held out, never looked at testing data.

K Nearest Neighbors has already been trained via GridSearchCV, so we only need to train logistic regression.
"""

lr = LogisticRegression(multi_class="multinomial", penalty='none')
lr.fit(X_train, Y_train)

y_pred_knn = gcv.predict(X_test)
y_pred_lr = lr.predict(X_test)

"""Let's take a look at the labels assigned by K Nearest Neighbors and Logistic Regression on some of our images. Do you see any disagreements? You can run the cell below multiple times to produce different images"""

n_row = 3
n_col = 5

shuff = np.random.permutation(len(Y_test))

fig, axs = plt.subplots(n_row, n_col, figsize=fsize)
for i, ax in zip(shuff, axs.ravel()):
  ax.set_title('True label: %d\n Prediction by KNN: %d \n Prediction by LR: %d' % (Y_test[i], y_pred_knn[i], y_pred_lr[i]))
  ax.set_xticks([])
  ax.set_yticks([])
  ax.imshow(X_test[i].reshape(14,14), 
            cmap='gray' )

for ax in axs.ravel()[len(label_dict):]:
  ax.set_visible(False)
  
plt.show()

"""We compute the scores on the never seen testing data set. You should hopefully see some consistency with your cross validation results. If not you need to think about what could cause a discrepancy. Maybe the characterstics of the testing data has changes since you collected the training data. Perhaps people stopped wearing pants and so these are missing from the testing data?

In general the test scores should not surprise you at this point.
"""

knn_acc = accuracy_score(Y_test, y_pred_knn)
lr_acc = accuracy_score(Y_test, y_pred_lr)

print("KNN acc: {:.4f}, lr acc: {:.4f}".format(knn_acc, lr_acc))

"""Great! On the MNIST dataset you should see that K Nearest Neighbors outperforms Logistic Regression, while on the Fashion MNIST dataset they do about equal.

However, there is more than one way to reach a certain accuracy score. Perhaps regardless of their overall performance one of these models was better than the other at a particular digit or a particular article of clothing. We can check the correlation of the results to see how similar these models are.
"""

(y_pred_knn == y_pred_lr).mean()

"""For a more in depth look we can form a confusion matrix between the K Nearest Neighbors predictions and the Logistic Regression predictions. Where there specific items that they tended to disagree on? In these cases who was correct?"""

pd.DataFrame(confusion_matrix(y_pred_knn, y_pred_lr), index=label_dict.values(), columns=label_dict.values())

"""BONUS

If we have two or more estimators that are fairly uncorrelated we can leverage their different strengths in a new model via a precedure known as stacking.

The idea goes like this. Model A and Model B are both looking at pictures of clothes and making predictions. They recieve an image which Model A confidently identifies as pants. Model B thinks that the image is of a dress, but is less sure. Model C recieves these results from Models A and B and knows from experience that Model A is an expert on things that look more like pants. Model C gives us our final prediction "pants".

Stacking can be thought of as a form of feature engineering or nonlinear transformation, which we haven't used so far in this notebook. From the original features (pixel values) we generate new features which are the outputs of our models. We then train a new model on these outputs to give us our overall prediction. 

Many models such as Random Forests and Neural Networks can be thought of as kinds of stacked models. Indeed you can even stack stacked models! In practice the best performance on machine learning problems tends to come from large ensembles of stacked models.

How to train and evaluate stacked models is a complex task that is very prone to overfitting. I would highly reccomend reading into various strategies, particularly from the Kaggle/Competition community (these peope love stacking), before using this in your own work.

For now we illustrate the power of stacking using sci-kit learn's StackingClassifier. Using this object is quite simple, you need to pass in the base estimators, the meta estimator (Model C above), and a cross validation strategy. Under the hood StackingClassifier trains the base estimators on the entire training set, as well as performs out of fold class probability estimates using the cross validation strategy. The meta estimator is then trained on these held out probability estimates instead of the actual training set predictions to help avoid overfitting.

Once again, using and understanding are very different things. I reccomend doing your own research. This section is vague on purpose.
"""

from sklearn.ensemble import StackingClassifier

estimators = [("knn", KNeighborsClassifier(5)), ("lr", LogisticRegression(multi_class='multinomial', penalty='none'))]
final_estimator = LogisticRegression(multi_class='multinomial', penalty='none')
cv = KFold(5, shuffle=True)

sc = StackingClassifier(estimators=estimators, final_estimator=final_estimator, cv=cv)

"""StackingClassifier can be fit like any estimator in the sci-kit learn API"""

sc.fit(X_train, Y_train)

"""Likewise, we can use the predict method to get results from our stacked model on X_test"""

y_pred_stack = sc.predict(X_test)

"""You might find an improvement over the individual models. Isn't that cool! It might be worth digging through the models, which are stored in the StackingClassifier object, to get a better understanding of how this stacked model is working."""

accuracy_score(Y_test, y_pred_stack)