# Tokenization

In this exercise, we will see examples of two types of tokenization:
- **Rule based** tokenization: this is done using a combination of code and regular expressions that encode tokenization rules developed by experts. Here we use the `SpaCy` tokenizer as an example.
- **BPE** tokenization: this is done to using statistics over a large corpus of text in order to determine how strings should be segmented into *subword* tokens. Here we use the `tiktoken` tokenizer from OpenAI.

## SpaCy tokenization and other examples

In [1]:
import spacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("MS is looking at buying U.K. startup for $11.1 million.")
for token in doc:
 print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
 token.shape_, token.is_alpha, token.is_stop)

MS MS PROPN NNP nsubj XX True False
is be AUX VBZ aux xx True True
looking look VERB VBG ROOT xxxx True False
at at ADP IN prep xx True True
buying buy VERB VBG pcomp xxxx True False
U.K. U.K. PROPN NNP nsubj X.X. False False
startup startup VERB VBD ccomp xxxx True False
for for ADP IN prep xxx True True
$ $ SYM $ quantmod $ False False
11.1 11.1 NUM CD compound dd.d False False
million million NUM CD pobj xxxx True False
. . PUNCT . punct . False False


### Display syntactic dependencies

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Microsoft plans to buy Activision for $69 billion.")
displacy.render(doc, style = "dep")

In [None]:
import spacy
from spacy import displacy

nlp = spacy.load("en_core_web_sm")
doc = nlp("Microsoft plans to buy Activision for $69 billion.")
displacy.render(doc, style = "ent")

### Use directly the tokenizer component

In [4]:
from spacy.lang.en import English

nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer("U.S. economy is healing, but there's a long way to go. "
 "The spread of Covid-19 led to surge in orders for factory robots."
 "Fine-tuning models is time-consuming.")

for token in tokens:
 print(token.text, end = ' ')
print()

U.S. economy is healing , but there 's a long way to go . The spread of Covid-19 led to surge in orders for factory robots . Fine - tuning models is time - consuming . 


In [5]:
from spacy.lang.en import English

nlp = English()
tokenizer = nlp.tokenizer
tokens = tokenizer("I think what she said is soooo craaaazy!")
for token in tokens:
 print(token.text, end = ' ')
print()

I think what she said is soooo craaaazy ! 


### Use only the tokenizer component by disabling other pipeline modules

In [6]:
import spacy
nlp = spacy.load("en_core_web_sm", exclude = ['tagger, ner, parser'])

doc = nlp("U.S. economy is healing, but there's a long way to go. "
 "The spread of Covid-19 led to surge in orders for factory robots.")

for token in doc:
 print(token.text, end = ' ')
print()
print()

for sent in doc.sents:
 for token in sent:
 print(token.text, end = ' ')
 print()

U.S. economy is healing , but there 's a long way to go . The spread of Covid-19 led to surge in orders for factory robots . 

U.S. economy is healing , but there 's a long way to go . 
The spread of Covid-19 led to surge in orders for factory robots . 


### Use a special sentencizer that does not require syntactic parsing, for efficiency

In [7]:
from spacy.lang.en import English

nlp = English()
nlp.add_pipe("sentencizer")

doc = nlp("U.S. economy is healing, but there's a long way to go. "
 "The spread of Covid-19 led to surge in orders for factory robots.")

# Print tokens, one sentence per line.
for sent in doc.sents:
 for token in sent:
 print (token, end = ' ')
 print()

U.S. economy is healing , but there 's a long way to go . 
The spread of Covid-19 led to surge in orders for factory robots . 


### By default, it seems spaCy creates tokens for newline characters

In [8]:
nlp = spacy.load("en_core_web_sm")

stanza = "I am a contract-drafting em,\n" \
 "The loyalest of lawyers!\n" \
 "I draw up terms for deals 'twixt firms\n" \
 "To service my employers!"
print(stanza, '\n')

doc = nlp(stanza)

print(f"Stanza has {len(list(doc.sents))} sentences.\n")

# Print tokens, one sentence per line.
for sent in doc.sents:
 for token in sent:
 if token.text == '\n':
 print('', end = ' ')
 else:
 print(token, end = ' ')
 print()

I am a contract-drafting em,
The loyalest of lawyers!
I draw up terms for deals 'twixt firms
To service my employers! 

Stanza has 2 sentences.

I am a contract - drafting em , The loyalest of lawyers ! 
I draw up terms for deals ' twixt firms To service my employers ! 


### Replace newlines with white spaces

In [9]:
stanza = "I am a contract-drafting em, " \
 "The loyalest of lawyers! " \
 "I draw up terms for deals 'twixt firms " \
 "To service my employers!"
print(stanza)
print()

doc = nlp(stanza)

print(f"Stanza has {len(list(doc.sents))} sentences.\n")

# Print tokens, one sentence per line.
for sent in doc.sents:
 for token in sent:
 print (token, end = ' ')
 print()

#for token in doc:
# print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
# token.shape_, token.is_alpha, token.is_stop)

I am a contract-drafting em, The loyalest of lawyers! I draw up terms for deals 'twixt firms To service my employers!

Stanza has 2 sentences.

I am a contract - drafting em , The loyalest of lawyers ! 
I draw up terms for deals ' twixt firms To service my employers ! 


## BPE tokenization using `tiktoken`

In [10]:
# Uncomment this line if tiktoken is not yet installed on your machine.
#!pip install tiktoken

import tiktoken

# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4")

# The .encode() method converts a text string into a list of token integers.
ltokens = enc.encode("soooo much rrrracing in Kannapolis this Summer!")
print(ltokens)

[708, 39721, 1790, 436, 637, 81, 4628, 304, 78311, 24751, 420, 19367, 0]


In [11]:
# The .decode() method converts a list of token integers to a string.
enc.decode(ltokens)

'soooo much rrrracing in Kannapolis this Summer!'

In [12]:
# The .decode_single_token_bytes() method safely converts a single integer token to the bytes it represents.
tokens = [enc.decode_single_token_bytes(token) for token in ltokens]
print(tokens)

[b'so', b'ooo', b' much', b' r', b'rr', b'r', b'acing', b' in', b' Kann', b'apolis', b' this', b' Summer', b'!']


In [13]:
# We usually combine .encode() with .decode_single_token_bytes() into one list comprehension
# to get the list of tokens as byte strings.
tokens = [enc.decode_single_token_bytes(token) for token in enc.encode("soooo much rrrracing in Kannapolis this Summer!")]

# Note the 'b' in front of each string, which means that the string you see is a sequence of bytes.
print(tokens)

[b'so', b'ooo', b' much', b' r', b'rr', b'r', b'acing', b' in', b' Kann', b'apolis', b' this', b' Summer', b'!']


In [14]:
# To translate to the standard representation (utf-8), you can use token.decode('utf-8').
utf8_tokens = [token.decode('utf-8') for token in tokens]
print(utf8_tokens)

['so', 'ooo', ' much', ' r', 'rr', 'r', 'acing', ' in', ' Kann', 'apolis', ' this', ' Summer', '!']


In [15]:
# Let's see how tiktoken deals with different types of white space.
tokens = [enc.decode_single_token_bytes(token) for token in enc.encode("His \t \n \n\t \t ambivalence was perplexing.")]
#tokens = enc.decode_single_token_bytes(enc.encode("His ambivalence was perplexing."))

print(tokens)

[b'His', b' \t', b' \n \n', b'\t ', b'\t ', b' amb', b'ivalence', b' was', b' ', b' perplex', b'ing', b'.']


In [16]:
tokens = [enc.decode_single_token_bytes(token) for token in enc.encode("I think what she said is soooo craaaazy!")]
print(tokens)

tokens[1].strip().decode('utf-8')

[b'I', b' think', b' what', b' she', b' said', b' is', b' so', b'ooo', b' cra', b'aa', b'azy', b'!']


'think'

In [17]:
# Another example showing subword tokens.
tokens = [enc.decode_single_token_bytes(token) for token in enc.encode("The perplexing cat sat on the mat.")]
print(tokens)

[b'The', b' perplex', b'ing', b' cat', b' sat', b' on', b' the', b' mat', b'.']


In [18]:
# Let's decode from the byte string representation to utf-8.
utf8_tokens = [token.decode('utf-8') for token in tokens]
print(utf8_tokens)

['The', ' perplex', 'ing', ' cat', ' sat', ' on', ' the', ' mat', '.']
