ITCS 4111/5111: Introduction to Natural Language Processing

ITCS 4111/5111: Introduction to Natural Language Processing
Spring 2024

Time and Location: Tue, Thu 10:00 – 11:15am, Woodward 130

Instructor & TA:	Razvan Bunescu	Justin Smith
Office:	Woodward 410G	Burson 239B
Office hours:	Tue, Thu 11:30 – 12:30pm	Wed, Fri 12:00 – 1:00pm
Email:	razvan.bunescu @ charlotte edu	jsmit840 @ charlotte edu

Recommended Texts (PDF available online):

Speech and Language Processing (3rd edition draft), by Daniel Juraksfy and James E. Martin. 2023.

Natural Language Processing, by Jacob Eisenstein. 2019.

Course description:
Natural Language Processing (NLP) is a branch of Artificial Intelligence concerned with developing computer systems that can analyze or generate natural language. This course will introduce fundamental linguistic analysis tasks, including tokenization, word representations, syntactic parsing, semantic parsing, and coreference resolution. Machine learning (ML) based techniques will be introduced, ranging from Naive Bayes and logistic regression to Transformer-based language models, which will be used in a number of NLP applications such as sentiment classification, information extraction, or question answering. Overall, the aim of this course is to equip students with an array techniques and tools that they can use to solve known NLP tasks, as well as new types of NLP problems.

Prerequisites:
Students are expected to be comfortable with programming in Python, data structures and algorithms (ITSC 2214), and have basic knowledge of linear algebra (MATH 2164), statistics, and formal languages (regular and context free grammars). Knowledge of machine learning will be very useful, though not strictly necessary. Relevant background material will be made available on this website throughout the course.

Lecture notes:

Syllabus & Introduction
- Hand notes from lectures on Jan 11 and Jan 18.
Python for programming, linear algebra, and visualization
- Python lecture
- Examples from lecture on Jan 23.
- Python tutorial
- NumPy tutorial
- Matplotlib tutorial
Tokenization: From text to sentences and tokens
- Hand notes from lectures on Jan 23 and Jan 30.
- Examples from lecture on Jan 25.
Regular expressions
- Hand notes from lecture on Jan 30.
- Examples from lecture on Feb 1.
- Regular expressions in Python with re
- Regular expressions in UNIX with sed
Strengths and Weaknesses of Language Models
- Hand notes from lecture on Feb 6.
Application development using LLMs through the Chat completion API
- Notebooks with examples using GPT, Llama2, and Mixtral.
- Assigning AI: Seven Approaches for Students, with Prompts, Mollick and Mollick, SSRN 2023.
- Demystifying Prompts in Language Models via Perplexity Estimation, Gonen et al., EMNLP 2023.
Building LLM-powered applications with LangChain
- LangChain examples, Jupyter notebooks, and hand notes from lecture on Feb 15.
- Compound AI Systems, Zaharia et al., Feb 18, 2024.
Text classification using Naive Bayes
- Hand notes from lecture on Feb 20: one, two, and three.
- Hand notes from lecture on Feb 22, Feb 29, and Mar 12.
Logistic regression
- Hand notes from lectures on Mar 14, Mar 19, and Mar 21.
- Python implementation of softmax.
- Chapter 5 on logistic regression from the textbook.
Biases vs. fairness and rationality in NLP models
- Hand notes from lecture on Mar 28.
- A Survey on Bias and Fairness in Machine Learning, Mehrabi et al., ACM Computing Surveys, 2021.
- Benchmarking Cognitive Biases in Large Language Models as Evaluators, Koo et al., ACL ARR 2024
- Challenging the appearance of machine intelligence: Cognitive bias in LLMs and Best Practices for Adoption, Talboy and Fuller, CoRR 2023.
- Capturing Failures of Large Language Models via Human Cognitive Biases, Jones and Steinhardt, NeurIPS 2022.
Manual annotation for NLP
- Brat rapid annotation tool
Word meanings; Sparse vs. dense representations of words
- Hand notes from lectures on Apr 2, Apr 4 and Apr 9.
- Hand notes from lecture on Apr 11: one and two.
- Chapter 6 on vector semantics and embeddings from the textbook.
- Slides from CS 6840 lecture on Word Embeddings.
- Visualization of word embeddings with TensorFlow Embedding Projector.
N-grams and Neural models for Language Modeling and Sequence Processing
- Hand notes from lectures on Apr 16 and Apr 18.
- Slides 1 to 20 from CS 4156 lecture on Feed-forward neural networks.
- All Our N-gram are Belong to You
Machine translation, Sequence-to-sequence models and Attention
- Chapter 9 in J & M on Deep Learning Architectures for Sequence Processing
Transformer: Self-Attention Networks
- Hand notes from lecture on Apr 25.
- Chapter 10 in J & M on Transformers and Pretrained Language Models
- Jay Alammar's Illustrated Transformer
- HuggingFace course section on How do Transformers work?
- Alexander Rush's Annotated Transformer in PyTorch.
Language Models: Pretraining and Fine-tuning
Language Models: Prompting, In-context Learning, Chain of Thought, Instruct Tuning, RLHF
- Training language models to follow instructions with human feedback, Ouyang et al., NeurIPS 2022.
- The Ethico-Political Universe of ChatGPT, by John Levi Martin, Journal of Social Computing, March 2023.
Coreference resolution
Syntax, constituency parsing, dependency parsing

Homework assignments^1,2:

Assignment 0 on Python lists and strings.
- Skeleton code.
Assignment 1 on Word distributions.
- Skeleton code and data.
Assignment 2 on Wikipedia processing with regular expressions.
- Skeleton code and data.
Assignment 3 on Question Answering on semi-structured restaurant data using the chat completion API with GPT, LLama2, or Mixtral.
- Skeleton code and data.
- Instructions for using Llama2 and Mixtral off campus.
- Instructions for using the HPC educational cluster.
Assignment 4 on LangChain for conversational assistants.
- Skeleton code and data.
Assignment 5 on Sentiment Analysis with Naive Bayes.
- Skeleton code and data.
Assignment 6 on Sentiment Analysis with Logistic Regression and engineered features.
- Skeleton code and data.
Assignment 7 on corpus acquisition, annotation, feature engineering, and evaluation.
- Skeleton code and data.
Assignment 8 on Vector Representations of Words.
- Skeleton code and data.
Assignment 9 on RNNs for Sentiment Classification.
- Skeleton code and data.
- Educational cluster instructions.
Assignment 10 on Transformer-based models for NLP.
- Skeleton code and data.

¹ The code for assignment 8 is based on an assignment from the CS224n course at Stanford on NLP with Deep Learning.

Final project:

Resources and guidelines for the final project
Tips for choosing a project topic:
- UT Austin
- Stanford
Project report guidelines

Background reading materials:

Python programming:
- Python for Everybody
- Python lecture
Probability and statistics:
- Basic probability theory (pp. 12-19) in Pattern Recognition and Machine Learning.
- Chapter 3 in DL textbook on Probability and Information Theory.
- Parts 1 and 2 in CS109: Probability for Computer Scientists course reader.
- Statistical Inference book, Casella and Berger, 2001.
- Seeing Theory: A visual introduction to probability and statistics, Kunin et al., 2018.
Linear Algebra:
- Chapter 2 in DL textbook on Linear Algebra.
- Chapter 2 on Linear Algebra in Mathematics for Machine Learning.
- Inderjit Dhillon's Linear Algebra Background
- Gilbert Strang's Introduction to Linear Algebra
- Petersen et al.'s The Matrix Cookbook
- Mike Brookes' Matrix Reference Manual
Calculus:
- Basic properties for derivatives, integrals, exponentials, and logarithms.
- Chapter 4.3 in DL textbook on Numerical Computation.
- Gilbert Strang's Calculus texbook.

Supplemental readings:

Out-of-Distribution Detection and Selective Generation for Conditional Language Models, Ren et al., ICLR 2023
Training language models to follow instructions with human feedback, Ouyang et al., NeurIPS 2022
Emergent World Representations: Exploring a Sequence Model Trained on a Synthetic Task, Li et al., ICLR 2023.
Theory of Mind May Have Spontaneously Emerged in Large Language Models, Michal Kosinski, Stanford 2023.
Are Emergent Abilities of Large Language Models a Mirage?, Schaeffer et al., DeployableGenerativeAI 2023.
What’s the Meaning of Superhuman Performance in Today’s NLU?, Tedeschi et al., ACL 2023.

Tools and packages:

Natural language processing:
- spaCy
- brat rapid annotation tool
Machine learning:
- PyTorch Deep learning in Python
- Scikit-learn Machine learning in Python
- Hugging Face ML models and datasets
Open-source platforms for LLM-based web apps: