{ "cells": [ { "cell_type": "markdown", "id": "49a75c73", "metadata": {}, "source": [ "
\n", "\n", "# Inmas Machine Learning Workshop January 2023\n", "Instructor: Christian Kuemmerle - kuemmerle@uncc.edu
\n", "Teaching Assistants: Emily Shinkle, Yuxuan Li, Derek Kielty, Yashil Sukurdeep, Tim Wang, Ben Brindle.\n", "\n", "\n", "## Session II" ] }, { "cell_type": "markdown", "id": "2b2d4f50", "metadata": {}, "source": [ "## Introduction to [Sentiment Analysis with VADER](https://www.geeksforgeeks.org/python-sentiment-analysis-using-vader/)\n", "\n", "VADER (Valence Aware Dictionary and sEntiment Reasoner) uses lexicons and rules to evaluate the sentiment expressed in text. It is particularly useful in the context of evaluating sentiments in social media. Using VADER in Python is surprisingly simple. Below you can apply VADER to get an idea of how NLP can be used." ] }, { "cell_type": "code", "execution_count": 10, "id": "6aa40dc9", "metadata": {}, "outputs": [], "source": [ "#!pip install vaderSentiment\n", "try: \n", " import vaderSentiment\n", "except ModuleNotFoundError:\n", " !conda install --yes -c conda-forge vadersentiment\n", " #!conda install --yes scikit-learn==1.2\n", " import vaderSentiment\n", "\n", "from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer" ] }, { "cell_type": "code", "execution_count": 11, "id": "96f9bb17", "metadata": {}, "outputs": [], "source": [ "# function to print sentiments of sentences (code from GeeksforGeeks)\n", "def sentiment_scores(sentence):\n", " \n", " # Create a SentimentIntensityAnalyzer object.\n", " sid_obj = SentimentIntensityAnalyzer()\n", " \n", " # polarity_scores method of SentimentIntensityAnalyzer\n", " # object gives a sentiment dictionary.\n", " # which contains pos, neg, neu, and compound scores.\n", " sentiment_dict = sid_obj.polarity_scores(sentence)\n", " \n", " print(\"Overall sentiment dictionary is : \", sentiment_dict)\n", " print(\"sentence was rated as \", sentiment_dict['neg']*100, \"% Negative\")\n", " print(\"sentence was rated as \", sentiment_dict['neu']*100, \"% Neutral\")\n", " print(\"sentence was rated as \", sentiment_dict['pos']*100, \"% Positive\")\n", " \n", " print(\"Sentence Overall Rated As\", end = \" \")\n", " \n", " # decide sentiment as positive, negative and neutral\n", " if sentiment_dict['compound'] >= 0.05 :\n", " print(\"Positive\")\n", " \n", " elif sentiment_dict['compound'] <= - 0.05 :\n", " print(\"Negative\")\n", " \n", " else :\n", " print(\"Neutral\")" ] }, { "cell_type": "markdown", "id": "8c3afd5b", "metadata": {}, "source": [ "Below we give a couple of examples to illustrate how just a few words can alter the sentiment of a sentence as measured by VADER. The first example is entirely neutral, while in the second example only one word, \"exciting,\" is added, but it changes the sentiment significantly. Learning things isn't necessarily good or bad, but learning exciting things is usually good." ] }, { "cell_type": "code", "execution_count": 24, "id": "90bac597", "metadata": { "scrolled": true }, "outputs": [], "source": [ "sentiment_scores(\"I have learned things from INMAS workshops!\")" ] }, { "cell_type": "code", "execution_count": 23, "id": "d091c8ad", "metadata": {}, "outputs": [], "source": [ "sentiment_scores(\"I have learned exciting things from INMAS workshops!\")" ] }, { "cell_type": "markdown", "id": "b3d7b07e", "metadata": {}, "source": [ "### Exercise:\n", "\n", "Try evaluating the sentiment of your own sentence:" ] }, { "cell_type": "code", "execution_count": null, "id": "8dcfd508", "metadata": {}, "outputs": [], "source": [ "#add your own sentence inside the parentheses - remember syntax for strings\n", "sentiment_scores()" ] }, { "cell_type": "markdown", "id": "cbd177a1", "metadata": {}, "source": [ "### Exercise:\n", "\n", "How might other machine learning algorithms we've seen in this workshop be coupled with sentiment analysis to learn text-based datasets?" ] }, { "cell_type": "markdown", "id": "c781887b", "metadata": {}, "source": [ "## Classification of [Newsgroups dataset](https://scikit-learn.org/stable/datasets/real_world.html#newsgroups-dataset) Text\n", "\n", "We now consider a text-based data set, which is based on e-mails. The e-mails have classified into 20 categories.\n", "\n", "The task is be to predict the categories of unseen e-mails based on a the knowledge of a set of already classified e-mails." ] }, { "cell_type": "markdown", "id": "b9fab406", "metadata": {}, "source": [ "### Understanding the data set\n", "\n", "We first load the data set and inspect its description. In this case, a seperation into training and test data is already provided (differentiated by the variable 'subset')." ] }, { "cell_type": "code", "execution_count": null, "id": "f4eb5fcd", "metadata": {}, "outputs": [], "source": [ "from sklearn.datasets import fetch_20newsgroups\n", "from sklearn.feature_extraction.text import CountVectorizer\n", "import numpy as np" ] }, { "cell_type": "code", "execution_count": null, "id": "c9e5d718", "metadata": {}, "outputs": [], "source": [ "train= fetch_20newsgroups(subset=\"train\")" ] }, { "cell_type": "code", "execution_count": null, "id": "0d4efbd3", "metadata": {}, "outputs": [], "source": [ "print(train.DESCR) #prints a description of the data set" ] }, { "cell_type": "code", "execution_count": null, "id": "f9147261", "metadata": {}, "outputs": [], "source": [ "# To see what the e-mails look like, we just display a few of them, as well as the corresponding category:\n", "print(train.data[0])\n", "print(train.target[0])\n", "print(\"Category: \",train.target_names[train.target[0]],\"\\n\")\n", "\n", "print(train.data[10])\n", "print(train.target[10])\n", "print(\"Category: \",train.target_names[train.target[10]],\"\\n\")\n", "\n", "print(train.data[50])\n", "print(train.target[50])\n", "print(\"Category: \",train.target_names[train.target[50]])\n" ] }, { "cell_type": "markdown", "id": "cd68e4b3", "metadata": {}, "source": [ "The available categories are the following:" ] }, { "cell_type": "code", "execution_count": null, "id": "dafc49e6", "metadata": {}, "outputs": [], "source": [ "train.target_names" ] }, { "cell_type": "markdown", "id": "43cfbd61", "metadata": {}, "source": [ "We can see above that data contains full e-mails that contain header and (sometimes) footer information. Since we want to assess the behavior of methods based on text only, avoiding the additional information of metadata, we reimport the data using the respective option.\n", "\n", "Furthermore, we select just a subset of the emails in 12 categories (instead of all 20) in order to speed up computations." ] }, { "cell_type": "code", "execution_count": null, "id": "af04668a", "metadata": {}, "outputs": [], "source": [ "train_allcat = train\n", "categories = ['alt.atheism','talk.religion.misc','comp.graphics','sci.space',\n", " 'comp.os.ms-windows.misc','comp.sys.ibm.pc.hardware','comp.sys.mac.hardware',\n", " 'talk.politics.misc','sci.med','rec.autos','sci.electronics','rec.motorcycles']\n", "\n", "train= fetch_20newsgroups(subset=\"train\",remove = ('headers', 'footers'),categories=categories)\n", "test= fetch_20newsgroups(subset=\"test\",remove = ('headers', 'footers'),categories=categories)" ] }, { "cell_type": "markdown", "id": "999250e0", "metadata": {}, "source": [ "We proceed to obtain some understanding of the data set: We extract the frequency of each category, first for the training set, then for the test set. Also, we obtain the memory size of the data we process." ] }, { "cell_type": "code", "execution_count": null, "id": "507817a6", "metadata": {}, "outputs": [], "source": [ "classe, frequency_train = np.unique(train.target, return_counts=True)\n", "print(classe)\n", "print(\"Class frequencies in training set: \",frequency_train)\n", "_, frequency_test = np.unique(test.target, return_counts=True)\n", "print(\"Class frequencies in test set: \",frequency_train)\n", "\n", "def size_mb(docs):\n", " return sum(len(s.encode('utf-8')) for s in docs) / 1e6\n", "print(\"Size of text in training set (all categories, w/ headers & footers): %0.3f MB\" % size_mb(train_allcat.data))\n", "print(\"Size of text in training set: %0.3f MB\" % size_mb(train.data))\n", "print(\"Size of text in test set: %0.3f MB\" % size_mb(test.data))" ] }, { "cell_type": "markdown", "id": "c373bb50", "metadata": {}, "source": [ "Apparently, the number of the samples is quite balanced across the categories (with a considerable smaller number only of samples except for the last category).\n", "The order of the categories is as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "ae82bab3", "metadata": {}, "outputs": [], "source": [ "train.target_names" ] }, { "cell_type": "code", "execution_count": null, "id": "96ebdc4c", "metadata": {}, "outputs": [], "source": [ "# So the category with the smallest number of samples in the training set (among the considered ones) are:\n", "train.target_names[0],train.target_names[10],train.target_names[11]\n" ] }, { "cell_type": "markdown", "id": "f929ab64", "metadata": {}, "source": [ "### Feature Extraction\n", "\n", "#### Count Vectorizer\n", "\n", "We now extract **features** which we can use for learning algorithms. We choose the [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html?highlight=countvectorizer#sklearn.feature_extraction.text.CountVectorizer) first, using a bound of $2^{14}$ features to be used - this is done to make the computations below faster. Ideally, one would increase this bound." ] }, { "cell_type": "code", "execution_count": null, "id": "1779752e", "metadata": {}, "outputs": [], "source": [ "from time import time\n", "from optparse import OptionParser\n", "import sys\n", "from sklearn.feature_extraction.text import TfidfVectorizer\n", "op = OptionParser()\n", "argv = []\n", "sys.argv[1:]\n", "(opts, args) = op.parse_args(argv)" ] }, { "cell_type": "code", "execution_count": null, "id": "d7c700f3", "metadata": {}, "outputs": [], "source": [ "print(\"Extracting features from the training data using a count vectorizer\")\n", "t0 = time()\n", "countvec = CountVectorizer(stop_words='english',max_features=2**14)\n", "X_train = countvec.fit_transform(train.data)\n", "X_test = countvec.transform(test.data) # Extracting features from the test data using the same vectorizer\n", "duration = time() - t0\n", "# check computational effort to compute the features\n", "print(\"done in %fs at %0.3fMB/s\" % (duration, size_mb(train.data) / duration))\n", "print(\"n_samples: %d, n_features: %d\" % X_train.shape)" ] }, { "cell_type": "markdown", "id": "7b023691", "metadata": {}, "source": [ "We create a pandas DataFrame in order to get an impression about the created dictionary and feature vectors:" ] }, { "cell_type": "code", "execution_count": null, "id": "b6ee4e70", "metadata": {}, "outputs": [], "source": [ "import pandas as pd\n", "X_train_countvec_df = pd.DataFrame(X_train.todense())\n", "\n", "# This are the different \"words\" that are in our vocabulary:\n", "X_train_countvec_df.columns = sorted(countvec.vocabulary_)\n", "print(X_train_countvec_df.columns)\n", "# This shows how a rows of our feature matrix look like:\n", "X_train_countvec_df" ] }, { "cell_type": "code", "execution_count": null, "id": "610bc05d", "metadata": {}, "outputs": [], "source": [ "X_train_countvec_df" ] }, { "cell_type": "markdown", "id": "ec4db2b1", "metadata": {}, "source": [ "## TF-IDF Vectorizer\n", "\n", "We repeat the feature extraction step using the [term frequency-inverse document frequency (TF-IDF)](https://en.wikipedia.org/wiki/Tf–idf) embedding of the documents." ] }, { "cell_type": "code", "execution_count": null, "id": "a7f28045", "metadata": {}, "outputs": [], "source": [ "print(\"Extracting features from the training data using a TF-IDF vectorizer\")\n", "t0 = time()\n", "vectorizer_tfidf = TfidfVectorizer(sublinear_tf=True, max_df=0.5,\n", " stop_words='english',max_features=2**14)\n", "X_train_tfidf = vectorizer_tfidf.fit_transform(train.data)\n", "X_test_tfidf = vectorizer_tfidf.transform(test.data) # Extracting features from the test data using the same vectorizer\n", "duration = time() - t0\n", "print(\"done in %fs at %0.3fMB/s\" % (duration, size_mb(train.data) / duration))\n", "print(\"n_features: %d\" % X_train_tfidf.shape[1])\n", "print()" ] }, { "cell_type": "markdown", "id": "1ce240f4", "metadata": {}, "source": [ "## Applying the learning algorithms\n", "\n", "### Logistic Regression\n", "\n", "Based on the both sets of extracted features, we now apply logistic regression to build a generalized linear model. Please note the [algorithmic options of logistic regression of scikit-learn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regression#sklearn.linear_model.LogisticRegression): The choice of the 'solver' becomes relevant here as the dataset is not that small. We also note that the default choice in the method is **with $\\ell_2$-regularization** (with regularization parameter $C=1$). See also [these instructions](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) for some comments on the differnet options.\n", "\n", "We first use the count vectorizer encoding." ] }, { "cell_type": "code", "execution_count": null, "id": "8511ba04", "metadata": {}, "outputs": [], "source": [ "y_train = train[\"target\"] \n", "y_test=test['target']\n", "\n", "from sklearn.linear_model import LogisticRegression\n", "t0 = time()\n", "lr =LogisticRegression(solver='lbfgs',max_iter=150,multi_class='multinomial').fit(X_train,train.target)\n", "print(\"Runtime of training of \"+str(lr)+\" with count vectorizer encoding: \",format(time()-t0,\"0.3f\"),\"s\")\n", "t0 = time()\n", "print(\"Mean accuracy of model \"+str(lr)+\" on training data with count vectorizer encoding: \",lr.score(X_train,y_train))\n", "print(\"Mean accuracy of model \"+str(lr)+\" on test data with count vectorizer encoding: \",lr.score(X_test,y_test))\n", "print(\"Runtime of evaluating \"+str(lr)+\" on training and test data with count vectorizer encoding: \",format(time()-t0,\"0.3f\"),\"s\")\n", "\n", "\n", "# We observe that the training time is reasonable, but not trivial anymore. On the other hand, evaluating the model (calculating the accuaracy) is still very quick\n", "# \n", "# Now, we repeat this with the TF-IDF features." ] }, { "cell_type": "code", "execution_count": null, "id": "d0359b4f", "metadata": {}, "outputs": [], "source": [ "t0 = time()\n", "lr2 =LogisticRegression(solver='lbfgs',max_iter=150,multi_class='multinomial').fit(X_train_tfidf,train.target)\n", "print(\"Runtime of training of \"+str(lr2)+\" with TF-IDF encoding: \",format(time()-t0,\"0.3f\"),\"s\")\n", "t0 = time()\n", "print(\"Mean accuracy of model \"+str(lr2)+\" on training data with TF-IDF encoding: \",lr2.score(X_train_tfidf,y_train))\n", "print(\"Mean accuracy of model \"+str(lr2)+\" on test data with TF-IDF encoding: \",lr2.score(X_test_tfidf,y_test))\n", "print(\"Runtime of evaluating \"+str(lr2)+\" on training and test data with TF-IDF encoding: \",format(time()-t0,\"0.3f\"),\"s\")" ] }, { "cell_type": "markdown", "id": "d7fb6f3e", "metadata": {}, "source": [ "We note that the training of the logistic regression model takes considerably longer than the evaluation. The test accuracy for the TF-IDF encoding is better than for the counting encoding. \n", "The very high training accuracy suggests that we are in a situation where overfitting occurs.\n", "\n", "We can test the predictive quality of these models also on custom text documents:" ] }, { "cell_type": "code", "execution_count": null, "id": "a0eb4efc", "metadata": {}, "outputs": [], "source": [ "text_test = ['Bill Gates and Steve Jobs are computer enterpreneurs.','Elon Musk wants to fly to Mars.']\n", "X_custom_counts=countvec.transform(text_test)\n", "X_custom_tfidf=vectorizer_tfidf.transform(text_test)\n", "\n", "predicted=lr.predict(X_custom_counts)\n", "predicted2=lr2.predict(X_custom_tfidf)\n", "\n", "print(\"Predicuts of model \"+str(lr)+\" (count vectorizer encoding)\")\n", "for doc,category in zip(text_test,predicted):\n", " print('%r => %s'%(doc,train.target_names[category]))\n", "print(\"Predicuts of model \"+str(lr2)+\" (TF-IDF encoding)\")\n", "for doc,category in zip(text_test,predicted2):\n", " print('%r => %s'%(doc,train.target_names[category]))" ] }, { "cell_type": "markdown", "id": "51238566", "metadata": {}, "source": [ "As we have been under the impression that overfitting has been occuring, we now run a cross-validation on $\\ell_2$-regularized logistic regression. We use \"LogisticRegressionCV\" instead of the generic method \"GridSearchCV\" (applied to a LogisticRegression) as this uses some tricks to make it computationally more efficient.\n", " \n", "However, running this cross validations for different regualarization parameters on a 5-fold split will still take a considerable amount of time. In practice, this could be run efficiently\n", "using distributed computing, even for larger data sets." ] }, { "cell_type": "code", "execution_count": null, "id": "8d6dc7f0", "metadata": {}, "outputs": [], "source": [ "from sklearn.linear_model import LogisticRegressionCV\n", "# We focus on the TF-IDF model as it exhibited better performance above.\n", "t0 = time()\n", "lr_optimal = LogisticRegressionCV(Cs=20,cv=5, random_state=10,penalty='l2', max_iter=150,multi_class='multinomial',solver='lbfgs',refit=True).fit(X_train_tfidf, y_train)\n", "print(\"Runtime of crossvalidation:\",format(time()-t0,\"0.3f\"),\"s\")\n", "print(\"Mean accuracy of model \"+str(lr_optimal)+\" on training data with TF-IDF encoding: \",lr_optimal.score(X_train_tfidf,y_train))\n", "print(\"Mean accuracy of model \"+str(lr_optimal)+\" on test data with TF-IDF encoding: \",lr_optimal.score(X_test_tfidf,y_test))" ] }, { "cell_type": "markdown", "id": "7f7f8b3f", "metadata": {}, "source": [ "We note that the cross validation was not successful in improving the accuracy on the hold-out test set.\n", " \n", "We plot the validation errors for the different regularization parameters $C$:" ] }, { "cell_type": "code", "execution_count": null, "id": "64a1b2dd", "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "plt.figure(figsize=(10,10))\n", "plt.plot(lr_optimal.Cs_,lr_optimal.scores_[1][1])\n", "ax = plt.gca()\n", "ax.set_xscale('log')\n", "ax.set(xlabel='Parameter C', ylabel='Validation accuracy')\n", "ax.set_box_aspect(1)\n", "\n", "\n" ] }, { "cell_type": "markdown", "id": "6883dbea", "metadata": {}, "source": [ "We note that the maximal validation accuracy is as follows:" ] }, { "cell_type": "code", "execution_count": null, "id": "fd09c7e0", "metadata": {}, "outputs": [], "source": [ "np.max(lr_optimal.scores_[1][1])" ] }, { "cell_type": "markdown", "id": "203a0d41", "metadata": {}, "source": [ "**Can you explain why the maximal validation accuracy is considerably larger than the test accuracy?**\n", "\n", "To get a better idea where the misclassifications take place (i.e., in which categories), we plot the [confusion matrix](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix):" ] }, { "cell_type": "code", "execution_count": null, "id": "67f2e83a", "metadata": {}, "outputs": [], "source": [ "from sklearn.metrics import plot_confusion_matrix\n", "import matplotlib.pyplot as plt\n", "y_test_predicted = lr_optimal.predict(X_test_tfidf)\n", "plt.figure(figsize=(30,18))\n", "ax1 = plt.gca()\n", "plot_confusion_matrix(lr_optimal,X_test_tfidf,y_test,display_labels=categories,ax=ax1)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "168c125e", "metadata": {}, "source": [ "### K-Nearest Neighbors & Support Vector Machines (optional, do if you have time)\n", " \n", "Instead of logistic regression, we can also other classifiers.\n", "\n", "**Please use a k-nearest neighbor classifier, for different k, and a support vector machine classifier.**\n", "Print the training runtime for each model, the mean accuracy on training data and test data sets." ] }, { "cell_type": "code", "execution_count": null, "id": "9d22c143", "metadata": {}, "outputs": [], "source": [ "from sklearn.neighbors import KNeighborsClassifier\n", "k_parameter = # try different values of k\n", "t0 = time()\n" ] }, { "cell_type": "markdown", "id": "77cf50ac", "metadata": {}, "source": [ "We observe that the performance of the nearest neighbors classifier is not very good in this setting (considering $5$ neighbors). Do you have any intuition why this is the case?" ] }, { "cell_type": "markdown", "id": "f7b0cc8a", "metadata": {}, "source": [ "Now, perform cross-validation over the parameter k of k-nearest neighbors for the problem, using GridSearchCV." ] }, { "cell_type": "code", "execution_count": null, "id": "64fcfdf4", "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import GridSearchCV\n", "ks=np.arange(1,41,2) # create vector of logarithmically interpolated values between 10^(-5) and 10^(9)\n", "parameters = {'n_neighbors':ks}\n" ] }, { "cell_type": "markdown", "id": "75682822", "metadata": {}, "source": [ "**Find and provide the training and validation accuracies of the cross validation/gridsearch.**" ] }, { "cell_type": "code", "execution_count": null, "id": "37759ea2", "metadata": {}, "outputs": [], "source": [ "train_accuracies = \n", "validation_accuracies = " ] }, { "cell_type": "markdown", "id": "b02fde9c", "metadata": {}, "source": [ "Plotting the training and validation accuracies, we observe the following:" ] }, { "cell_type": "code", "execution_count": null, "id": "15b86221", "metadata": {}, "outputs": [], "source": [ "plt.figure()\n", "plt.plot(ks,train_accuracies)\n", "plt.plot(ks,validation_accuracies)\n", "ax = plt.gca()\n", "ax.set(xlabel='alpha', ylabel='accuracy',\n", " title='Cross-validation accuracies for k-Nearest Neighbors')\n", "ax.legend([\"training data\",\"validation data\"], loc=0)\n", "print(\"Best parameter k:\",str(gridsearch.best_params_['n_neighbors']))\n", "#ax.set_xticks()\n", "print(\"Mean accuracy of model \"+str(gridsearch.best_estimator_)+\" on test data with TF-IDF encoding: \",gridsearch.best_estimator_.score(X_test_tfidf,y_test))\n" ] }, { "cell_type": "markdown", "id": "5e712f04", "metadata": {}, "source": [ "**How crossvalidated k-nn compare to the support vector machine you used?**" ] }, { "cell_type": "code", "execution_count": null, "id": "53a0313d", "metadata": {}, "outputs": [], "source": [] }, { "cell_type": "markdown", "id": "118c671a", "metadata": {}, "source": [ "**Is there a parameter you can \"cross-validate\" (optimize by cross validation) if using a support vector machine?**" ] }, { "cell_type": "code", "execution_count": null, "id": "a0090d2e", "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.11.0" } }, "nbformat": 4, "nbformat_minor": 5 }