{ "cells": [ { "cell_type": "markdown", "id": "5b251563", "metadata": { "id": "5b251563" }, "source": [ "# Tokenization\n", "\n", "In this exercise, we will see examples of two types of tokenization:\n", "- **Rule based** tokenization: this is done using a combination of code and regular expressions that encode tokenization rules developed by experts. Here we use the `SpaCy` tokenizer as an example.\n", "- **BPE** tokenization: this is done to using statistics over a large corpus of text in order to determine how strings should be segmented into *subword* tokens. Here we use the `tiktoken` tokenizer from OpenAI." ] }, { "cell_type": "markdown", "id": "f2275eb4", "metadata": { "id": "f2275eb4" }, "source": [ "## SpaCy tokenization and other examples" ] }, { "cell_type": "code", "execution_count": 1, "id": "fe178c86", "metadata": { "id": "fe178c86" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "MS MS PROPN NNP nsubj XX True False\n", "is be AUX VBZ aux xx True True\n", "looking look VERB VBG ROOT xxxx True False\n", "at at ADP IN prep xx True True\n", "buying buy VERB VBG pcomp xxxx True False\n", "U.K. U.K. PROPN NNP nsubj X.X. False False\n", "startup startup VERB VBD ccomp xxxx True False\n", "for for ADP IN prep xxx True True\n", "$ $ SYM $ quantmod $ False False\n", "11.1 11.1 NUM CD compound dd.d False False\n", "million million NUM CD pobj xxxx True False\n", ". . PUNCT . punct . False False\n" ] } ], "source": [ "import spacy\n", "\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(\"MS is looking at buying U.K. startup for $11.1 million.\")\n", "for token in doc:\n", " print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,\n", " token.shape_, token.is_alpha, token.is_stop)" ] }, { "cell_type": "markdown", "id": "9b8db1cb", "metadata": { "id": "9b8db1cb" }, "source": [ "### Display syntactic dependencies" ] }, { "cell_type": "code", "execution_count": null, "id": "6ea85f0f", "metadata": { "id": "6ea85f0f", "scrolled": true }, "outputs": [], "source": [ "import spacy\n", "from spacy import displacy\n", "\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(\"Microsoft plans to buy Activision for $69 billion.\")\n", "displacy.render(doc, style = \"dep\")" ] }, { "cell_type": "code", "execution_count": null, "id": "62f31857", "metadata": { "id": "62f31857" }, "outputs": [], "source": [ "import spacy\n", "from spacy import displacy\n", "\n", "nlp = spacy.load(\"en_core_web_sm\")\n", "doc = nlp(\"Microsoft plans to buy Activision for $69 billion.\")\n", "displacy.render(doc, style = \"ent\")" ] }, { "cell_type": "markdown", "id": "d57e2eff", "metadata": { "id": "d57e2eff" }, "source": [ "### Use directly the tokenizer component" ] }, { "cell_type": "code", "execution_count": 4, "id": "ea337ee2", "metadata": { "id": "ea337ee2", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. economy is healing , but there 's a long way to go . The spread of Covid-19 led to surge in orders for factory robots . Fine - tuning models is time - consuming . \n" ] } ], "source": [ "from spacy.lang.en import English\n", "\n", "nlp = English()\n", "tokenizer = nlp.tokenizer\n", "tokens = tokenizer(\"U.S. economy is healing, but there's a long way to go. \"\n", " \"The spread of Covid-19 led to surge in orders for factory robots.\"\n", " \"Fine-tuning models is time-consuming.\")\n", "\n", "for token in tokens:\n", " print(token.text, end = ' ')\n", "print()" ] }, { "cell_type": "code", "execution_count": 5, "id": "9ef3cbab", "metadata": { "id": "9ef3cbab" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I think what she said is soooo craaaazy ! \n" ] } ], "source": [ "from spacy.lang.en import English\n", "\n", "nlp = English()\n", "tokenizer = nlp.tokenizer\n", "tokens = tokenizer(\"I think what she said is soooo craaaazy!\")\n", "for token in tokens:\n", " print(token.text, end = ' ')\n", "print()" ] }, { "cell_type": "markdown", "id": "e3cb106b", "metadata": { "id": "e3cb106b" }, "source": [ "### Use only the tokenizer component by disabling other pipeline modules" ] }, { "cell_type": "code", "execution_count": 6, "id": "35b3b8a6", "metadata": { "id": "35b3b8a6", "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. economy is healing , but there 's a long way to go . The spread of Covid-19 led to surge in orders for factory robots . \n", "\n", "U.S. economy is healing , but there 's a long way to go . \n", "The spread of Covid-19 led to surge in orders for factory robots . \n" ] } ], "source": [ "import spacy\n", "nlp = spacy.load(\"en_core_web_sm\", exclude = ['tagger, ner, parser'])\n", "\n", "doc = nlp(\"U.S. economy is healing, but there's a long way to go. \"\n", " \"The spread of Covid-19 led to surge in orders for factory robots.\")\n", "\n", "for token in doc:\n", " print(token.text, end = ' ')\n", "print()\n", "print()\n", "\n", "for sent in doc.sents:\n", " for token in sent:\n", " print(token.text, end = ' ')\n", " print()" ] }, { "cell_type": "markdown", "id": "3930099a", "metadata": { "id": "3930099a" }, "source": [ "### Use a special sentencizer that does not require syntactic parsing, for efficiency" ] }, { "cell_type": "code", "execution_count": 7, "id": "f51c5246", "metadata": { "id": "f51c5246" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "U.S. economy is healing , but there 's a long way to go . \n", "The spread of Covid-19 led to surge in orders for factory robots . \n" ] } ], "source": [ "from spacy.lang.en import English\n", "\n", "nlp = English()\n", "nlp.add_pipe(\"sentencizer\")\n", "\n", "doc = nlp(\"U.S. economy is healing, but there's a long way to go. \"\n", " \"The spread of Covid-19 led to surge in orders for factory robots.\")\n", "\n", "# Print tokens, one sentence per line.\n", "for sent in doc.sents:\n", " for token in sent:\n", " print (token, end = ' ')\n", " print()" ] }, { "cell_type": "markdown", "id": "eda8937b", "metadata": { "id": "eda8937b" }, "source": [ "### By default, it seems spaCy creates tokens for newline characters" ] }, { "cell_type": "code", "execution_count": 8, "id": "99f5e45e", "metadata": { "id": "99f5e45e" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am a contract-drafting em,\n", "The loyalest of lawyers!\n", "I draw up terms for deals 'twixt firms\n", "To service my employers! \n", "\n", "Stanza has 2 sentences.\n", "\n", "I am a contract - drafting em , The loyalest of lawyers ! \n", "I draw up terms for deals ' twixt firms To service my employers ! \n" ] } ], "source": [ "nlp = spacy.load(\"en_core_web_sm\")\n", "\n", "stanza = \"I am a contract-drafting em,\\n\" \\\n", " \"The loyalest of lawyers!\\n\" \\\n", " \"I draw up terms for deals 'twixt firms\\n\" \\\n", " \"To service my employers!\"\n", "print(stanza, '\\n')\n", "\n", "doc = nlp(stanza)\n", "\n", "print(f\"Stanza has {len(list(doc.sents))} sentences.\\n\")\n", "\n", "# Print tokens, one sentence per line.\n", "for sent in doc.sents:\n", " for token in sent:\n", " if token.text == '\\n':\n", " print('', end = ' ')\n", " else:\n", " print(token, end = ' ')\n", " print()" ] }, { "cell_type": "markdown", "id": "beef4aa6", "metadata": { "id": "beef4aa6" }, "source": [ "### Replace newlines with white spaces" ] }, { "cell_type": "code", "execution_count": 9, "id": "df65dafc", "metadata": { "id": "df65dafc" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "I am a contract-drafting em, The loyalest of lawyers! I draw up terms for deals 'twixt firms To service my employers!\n", "\n", "Stanza has 2 sentences.\n", "\n", "I am a contract - drafting em , The loyalest of lawyers ! \n", "I draw up terms for deals ' twixt firms To service my employers ! \n" ] } ], "source": [ "stanza = \"I am a contract-drafting em, \" \\\n", " \"The loyalest of lawyers! \" \\\n", " \"I draw up terms for deals 'twixt firms \" \\\n", " \"To service my employers!\"\n", "print(stanza)\n", "print()\n", "\n", "doc = nlp(stanza)\n", "\n", "print(f\"Stanza has {len(list(doc.sents))} sentences.\\n\")\n", "\n", "# Print tokens, one sentence per line.\n", "for sent in doc.sents:\n", " for token in sent:\n", " print (token, end = ' ')\n", " print()\n", "\n", "#for token in doc:\n", "# print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,\n", "# token.shape_, token.is_alpha, token.is_stop)" ] }, { "cell_type": "markdown", "id": "98d8adbd", "metadata": { "id": "98d8adbd" }, "source": [ "## BPE tokenization using `tiktoken`" ] }, { "cell_type": "code", "execution_count": 10, "id": "0f986314", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 13, "status": "ok", "timestamp": 1756836696317, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "0f986314", "outputId": "c0672edd-587d-4204-b442-468af1382729" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[708, 39721, 1790, 436, 637, 81, 4628, 304, 78311, 24751, 420, 19367, 0]\n" ] } ], "source": [ "# Uncomment this line if tiktoken is not yet installed on your machine.\n", "#!pip install tiktoken\n", "\n", "import tiktoken\n", "\n", "# To get the tokeniser corresponding to a specific model in the OpenAI API:\n", "enc = tiktoken.encoding_for_model(\"gpt-4\")\n", "\n", "# The .encode() method converts a text string into a list of token integers.\n", "ltokens = enc.encode(\"soooo much rrrracing in Kannapolis this Summer!\")\n", "print(ltokens)" ] }, { "cell_type": "code", "execution_count": 11, "id": "3cadd5c7", "metadata": { "colab": { "base_uri": "https://localhost:8080/", "height": 35 }, "executionInfo": { "elapsed": 6, "status": "ok", "timestamp": 1756836770758, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "3cadd5c7", "outputId": "f2059859-8bc0-4678-899c-b555f0aff1ed" }, "outputs": [ { "data": { "text/plain": [ "'soooo much rrrracing in Kannapolis this Summer!'" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The .decode() method converts a list of token integers to a string.\n", "enc.decode(ltokens)" ] }, { "cell_type": "code", "execution_count": 12, "id": "ca0788cc", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 4, "status": "ok", "timestamp": 1756836795019, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "ca0788cc", "outputId": "a57c6143-4fc8-4c31-fd75-4437e2c8550a" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[b'so', b'ooo', b' much', b' r', b'rr', b'r', b'acing', b' in', b' Kann', b'apolis', b' this', b' Summer', b'!']\n" ] } ], "source": [ "# The .decode_single_token_bytes() method safely converts a single integer token to the bytes it represents.\n", "tokens = [enc.decode_single_token_bytes(token) for token in ltokens]\n", "print(tokens)" ] }, { "cell_type": "code", "execution_count": 13, "id": "0c53df40", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 23, "status": "ok", "timestamp": 1756836881529, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "0c53df40", "outputId": "b9f3fbb7-7647-4d6b-ee33-3ef1dfed5990" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[b'so', b'ooo', b' much', b' r', b'rr', b'r', b'acing', b' in', b' Kann', b'apolis', b' this', b' Summer', b'!']\n" ] } ], "source": [ "# We usually combine .encode() with .decode_single_token_bytes() into one list comprehension\n", "# to get the list of tokens as byte strings.\n", "tokens = [enc.decode_single_token_bytes(token) for token in enc.encode(\"soooo much rrrracing in Kannapolis this Summer!\")]\n", "\n", "# Note the 'b' in front of each string, which means that the string you see is a sequence of bytes.\n", "print(tokens)" ] }, { "cell_type": "code", "execution_count": 14, "id": "lAomeqOCTYyJ", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 26, "status": "ok", "timestamp": 1756836890016, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "lAomeqOCTYyJ", "outputId": "83df025a-790a-470d-8d61-47fc66c71ff1" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['so', 'ooo', ' much', ' r', 'rr', 'r', 'acing', ' in', ' Kann', 'apolis', ' this', ' Summer', '!']\n" ] } ], "source": [ "# To translate to the standard representation (utf-8), you can use token.decode('utf-8').\n", "utf8_tokens = [token.decode('utf-8') for token in tokens]\n", "print(utf8_tokens)" ] }, { "cell_type": "code", "execution_count": 15, "id": "2e543c7b", "metadata": { "colab": { "base_uri": "https://localhost:8080/" }, "executionInfo": { "elapsed": 3, "status": "ok", "timestamp": 1756836920748, "user": { "displayName": "Razvan Bunescu", "userId": "08159777761660776328" }, "user_tz": 240 }, "id": "2e543c7b", "outputId": "be125a5d-6fa7-4b47-bdd2-8eaac0c65daf" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[b'His', b' \\t', b' \\n \\n', b'\\t ', b'\\t ', b' amb', b'ivalence', b' was', b' ', b' perplex', b'ing', b'.']\n" ] } ], "source": [ "# Let's see how tiktoken deals with different types of white space.\n", "tokens = [enc.decode_single_token_bytes(token) for token in enc.encode(\"His \\t \\n \\n\\t \\t ambivalence was perplexing.\")]\n", "#tokens = enc.decode_single_token_bytes(enc.encode(\"His ambivalence was perplexing.\"))\n", "\n", "print(tokens)" ] }, { "cell_type": "code", "execution_count": 16, "id": "96de35c3", "metadata": { "id": "96de35c3" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[b'I', b' think', b' what', b' she', b' said', b' is', b' so', b'ooo', b' cra', b'aa', b'azy', b'!']\n" ] }, { "data": { "text/plain": [ "'think'" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "tokens = [enc.decode_single_token_bytes(token) for token in enc.encode(\"I think what she said is soooo craaaazy!\")]\n", "print(tokens)\n", "\n", "tokens[1].strip().decode('utf-8')" ] }, { "cell_type": "code", "execution_count": 17, "id": "c1ea0e31", "metadata": { "id": "c1ea0e31" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "[b'The', b' perplex', b'ing', b' cat', b' sat', b' on', b' the', b' mat', b'.']\n" ] } ], "source": [ "# Another example showing subword tokens.\n", "tokens = [enc.decode_single_token_bytes(token) for token in enc.encode(\"The perplexing cat sat on the mat.\")]\n", "print(tokens)" ] }, { "cell_type": "code", "execution_count": 18, "id": "8a247f86", "metadata": { "id": "8a247f86" }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "['The', ' perplex', 'ing', ' cat', ' sat', ' on', ' the', ' mat', '.']\n" ] } ], "source": [ "# Let's decode from the byte string representation to utf-8.\n", "utf8_tokens = [token.decode('utf-8') for token in tokens]\n", "print(utf8_tokens)" ] }, { "cell_type": "code", "execution_count": null, "id": "6c174009", "metadata": { "id": "6c174009" }, "outputs": [], "source": [] } ], "metadata": { "colab": { "provenance": [] }, "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.4" } }, "nbformat": 4, "nbformat_minor": 5 }