NLP | TheAILearner

In Natural Language Processing (NLP), the way we represent text has a profound impact on the performance of our models. Tokenization, the process of breaking down text into smaller manageable units called tokens, is a foundational step in preparing text data for NLP tasks. While simple techniques like word-level tokenization exist, advanced methods like Byte Pair Encoding (BPE), SentencePiece, and WordPiece offer advantages, particularly when dealing with large vocabularies and out-of-vocabulary (OOV) words. Let’s delve into these techniques and understand their nuances.

What is Tokenization?

Tokenization is the process of segmenting a piece of text into smaller units called tokens. These tokens can range from:

Words: “The cat sat on the mat.” -> [“The”, “cat”, “sat”, “on”, “the”, “mat”]
Characters: “NLP is cool!” -> [“N”, “L”, “P”, ” “, “i”, “s”, ” “, “c”, “o”, “o”, “l”, “!”]
Subwords: “understandable” -> [“under”, “##stand”, “##able”]

Why Tokenization Matters

Manageable Units: NLP models generally don’t work directly with raw text. Tokens provide a structured representation for models to process.
Vocabulary Size: Tokenization techniques can influence the vocabulary size of your model, directly impacting memory usage and computational efficiency.

Advanced Tokenization Techniques

Let’s explore some sophisticated tokenization techniques frequently used in modern NLP models:

Byte Pair Encoding: BPE is a data compression technique adapted for NLP. It works as follows:
Initialization: Starts with a vocabulary of individual characters.
Iterative Merging: The most frequent pair of consecutive characters is identified and merged into a new symbol. This process is repeated until a desired vocabulary size is reached.
Example:
Initial vocabulary: ['a', 'b', 'd', 'e', 'g']
Most frequent pair: 'e', 'g' -> merge into 'eg'
New vocabulary: ['a', 'b', 'd', 'eg']
Advantage: BPE effectively handles rare and out-of-vocabulary words by representing them as sequences of subword tokens.

SentencePiece: SentencePiece builds upon BPE but has a key distinction—it treats the input text as a stream of Unicode characters without predefined word boundaries. This makes it language-independent and robust to different writing systems.
Advantage: SentencePiece is particularly useful for languages that don’t have clear-cut word boundaries, such as Chinese or Japanese.

WordPiece: WordPiece is similar to BPE but uses a probabilistic approach to select the best subword merges. It aims to produce subwords that are meaningful from a linguistic perspective.
Advantage: WordPiece often results in more intuitive subword units compared to BPE.

Comparison

Technique	Description	Pros	Cons
BPE	Iteratively merges frequent character pairs	Handles OOV words, language-agnostic	Can produce less intuitive subwords
SentencePiece	BPE-like, operates on raw Unicode text	Handles languages without word boundaries	Can be slightly slower than BPE
WordPiece	Probabilistic version of BPE	More linguistically meaningful subwords	A bit more computationally intensive

Let’s See Them in Action! (Example in Python using a hypothetical tokenizer)

tokenizer = Tokenizer(method='bpe', vocab_size=5000) 

text = "Tokenization is a fascinating process!"
tokens = tokenizer.encode(text)
print(tokens) 
# Output: ['Token', '##ization', ' is', ' a', ' fascin', '##ating', ' pro', '##cess', '!']

tokenizer = Tokenizer(method='bpe', vocab_size=5000)

text = "Tokenization is a fascinating process!"

tokens = tokenizer.encode(text)

print(tokens)

# Output: ['Token', '##ization', ' is', ' a', ' fascin', '##ating', ' pro', '##cess', '!']

Choosing the Right Technique

The best tokenization technique depends on your dataset, language, and the specific NLP task you are tackling. Consider experimenting to find what works optimally for your needs!

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

Introduction

The BLEU score, which stands for Bilingual Evaluation Understudy, is a metric commonly used to evaluate the quality of machine-generated translations compared to human translations. It measures the similarity between the machine-generated translation and one or more reference translations, assigning a numerical score between 0 and 1. The higher the BLEU score, the closer the machine translation is to the reference translations, indicating better translation quality. BLEU score takes into account factors such as n-gram precision and brevity penalty, providing a useful quantitative measure for comparing different translation systems or assessing improvements in machine translation over time. Don’t worry, we will discuss these terms as we go along with the blog.

Precision

Input Sentence: “Hay un tigre en el bosque”
Human Reference: “There is a tiger in the woods”

Lets assume machine translated output is: “the the the the the”
Accuracy of the machine-generated translation compared to the reference translations can be calculated using precision. Precision basically checks for each word in generated output if it is present in reference sentence or not. So in the given example it will be 5/5. It gives high value even the machine translated output is far away from reference sentence. There comes modified precision. In modified precision we calculate the maximum frequency of word present in the reference sentence. Which will compute to 1/5. This one was for unigram (one word at a time). Similarly it is calculated for n-gram.

Formula

The formula for BLEU score with brevity penalty is as follows:

BLEU = BP * exp(sum(n-gram precision) / N)

Where:

BP (Brevity Penalty) is a penalty term that adjusts the BLEU score based on the brevity of the machine generated translation compared to the reference translations.
n-gram precision is the precision of n-grams (substrings of length n) in the machine generated translation, which is the count of n-gram matches between the machine generated and reference translations divided by the count of n-grams in the machine generated translation.
N is the maximum n-gram order considered in the calculation (typically 4).

The brevity penalty term BP is calculated as:

BP = 1, if c > r
BP = exp(1 – r/c), if c ≤ r

Where:

c is the length (in words) of the machine generated translation.
r is the length (in words) of the closest reference translation.

In this formula, the brevity penalty is applied to adjust the BLEU score based on the difference in length between the candidate and reference translations. If the candidate translation is shorter than the reference, the penalty term encourages longer translations, and if the candidate translation is longer, it discourages excessively long translations.

Implementation

import nltk
nltk.download('punkt')
import math
from collections import Counter

def tokenize(sentence):
    return nltk.word_tokenize(sentence)

def calculate_ngram(candidate, n):
    ngrams = []
    for i in range(len(candidate)-n+1):
        ngram = tuple(candidate[i:i+n])
        ngrams.append(ngram)
    return ngrams

def calculate_precision(candidate, references, n):
    candidate_ngrams = calculate_ngram(candidate, n)
    reference_ngrams = [calculate_ngram(ref, n) for ref in references]

    candidate_counter = Counter(candidate_ngrams)
    reference_counters = [Counter(ref) for ref in reference_ngrams]

    clipped_counts = dict()
    for ngram, count in candidate_counter.items():
        max_reference_count = max(ref_counter[ngram] for ref_counter in reference_counters)
        clipped_counts[ngram] = min(count, max_reference_count)

    numerator = sum(clipped_counts.values())
    denominator = max(1, sum(candidate_counter.values()))

    precision = numerator / denominator
    return precision

def calculate_bleu(candidate, references, weights):
    candidate_tokens = tokenize(candidate)
    reference_tokens = [tokenize(ref) for ref in references]

    precisions = []
    for n in range(1, len(weights) + 1):
        precision = calculate_precision(candidate_tokens, reference_tokens, n)
        precisions.append(precision)

    # Handling NaN or infinite values in precision
    precisions = [p if not math.isnan(p) and p != 0.0 else 1e-10 for p in precisions]

    geo_mean = math.exp(sum((w * math.log(p) for w, p in zip(weights, precisions))) / len(weights))
    brevity_penalty = min(1.0, len(candidate_tokens) / min(len(ref) for ref in reference_tokens))

    bleu = brevity_penalty * geo_mean
    return bleu

# Example usage
candidate = "The cat is on the mat"
references = ["There is a cat on the mat", "The mat has a cat"]
weights = [0.25, 0.25, 0.25, 0.25]

bleu_score = calculate_bleu(candidate, references, weights)
print("BLEU score:", bleu_score)

import nltk

nltk.download('punkt')

import math

from collections import Counter

def tokenize(sentence):

return nltk.word_tokenize(sentence)

def calculate_ngram(candidate, n):

ngrams = []

for i in range(len(candidate)-n+1):

ngram = tuple(candidate[i:i+n])

ngrams.append(ngram)

return ngrams

def calculate_precision(candidate, references, n):

candidate_ngrams = calculate_ngram(candidate, n)

reference_ngrams = [calculate_ngram(ref, n) for ref in references]

candidate_counter = Counter(candidate_ngrams)

reference_counters = [Counter(ref) for ref in reference_ngrams]

clipped_counts = dict()

for ngram, count in candidate_counter.items():

max_reference_count = max(ref_counter[ngram] for ref_counter in reference_counters)

clipped_counts[ngram] = min(count, max_reference_count)

numerator = sum(clipped_counts.values())

denominator = max(1, sum(candidate_counter.values()))

precision = numerator / denominator

return precision

def calculate_bleu(candidate, references, weights):

candidate_tokens = tokenize(candidate)

reference_tokens = [tokenize(ref) for ref in references]

precisions = []

for n in range(1, len(weights) + 1):

precision = calculate_precision(candidate_tokens, reference_tokens, n)

precisions.append(precision)

# Handling NaN or infinite values in precision

precisions = [p if not math.isnan(p) and p != 0.0 else 1e-10 for p in precisions]

geo_mean = math.exp(sum((w * math.log(p) for w, p in zip(weights, precisions))) / len(weights))

brevity_penalty = min(1.0, len(candidate_tokens) / min(len(ref) for ref in reference_tokens))

bleu = brevity_penalty * geo_mean

return bleu

# Example usage

candidate = "The cat is on the mat"

references = ["There is a cat on the mat", "The mat has a cat"]

weights = [0.25, 0.25, 0.25, 0.25]

bleu_score = calculate_bleu(candidate, references, weights)

print("BLEU score:", bleu_score)

Here’s a breakdown of the code:

Tokenization:
- The tokenize function splits a given sentence into individual words or tokens.
N-gram Calculation:
- The calculate_ngram function takes a list of tokens (words) and an integer n as input, and it returns a list of n-grams (contiguous sequences of n tokens) from the input list.
Precision Calculation:
- The calculate_precision function computes the precision score for a given candidate sentence in comparison to one or more reference sentences. It uses n-grams for this calculation.
- It counts the occurrences of n-grams in both the candidate and reference sentences and computes a precision value.
BLEU Calculation:
- The calculate_bleu function takes a candidate sentence, a list of reference sentences, and a list of weights as input.
- It tokenizes the input sentences, calculates precision for different n-gram sizes, and combines them using a weighted geometric mean.
- The BLEU score is a combination of precision values for different n-gram sizes, and the weights are used to assign importance to each n-gram size.
Example Usage:
- An example is provided at the end, where a candidate sentence (“The cat is on the mat”) is compared to two reference sentences (“There is a cat on the mat” and “The mat has a cat”).
- The weights for different n-gram sizes are set to equal values (0.25 each), and the BLEU score is calculated using the calculate_bleu function.
- The final BLEU score is printed out.

If you have any doubt/suggestion please feel free to ask and I will do my best to help or improve myself. Good-bye until next time.

TheAILearner

Mastering Artificial Intelligence

Tag Archives: NLP

Advanced Tokenization Techniques in NLP

BLEU Score – Bilingual Evaluation Understudy

Introduction

Precision

Formula

Implementation