Santhosh Vijayabaskar
Posted on October 18, 2024
Artificial Intelligence is everywhere these days, and language models are a big part of that. We all have always wondered how AI can predict the next word in a sentence or even write entire paragraphs. In this tutorial, we’ll build a super simple language model without relying on fancy frameworks like TensorFlow or PyTorch—just plain Python and NumPy.
Sounds cool? Let’s get started!🚀
What We’re Building:
We'll be creating a bigram model. It predicts the next word in a sentence based on the current word. We’ll keep it straightforward and easy to follow so you’ll learn how things work without getting buried in too much detail.🧠💡
Step 1: Set Up
Before we begin, let's make sure you’ve got Python and NumPy ready to go. If you don’t have NumPy installed, quickly install it with:
pip install numpy
Step 2: Understanding the Basics
A language model predicts the next word in a sentence. We’ll keep things simple and build a bigram model. This just means that our model will predict the next word using only the current word.
We’ll start with a short text to train the model. Here’s a small sample we’ll use:
import numpy as np
# Sample dataset: A small text corpus
corpus = """Artificial Intelligence is the new electricity.
Machine learning is the future of AI.
AI is transforming industries and shaping the future."""
Step 3: Preparing the Text
First things first, we need to break this text into individual words and create a vocabulary (basically a list of all unique words). This gives us something to work with.
# Tokenize the corpus into words
words = corpus.lower().split()
# Create a vocabulary of unique words
vocab = list(set(words))
vocab_size = len(vocab)
print(f"Vocabulary: {vocab}")
print(f"Vocabulary size: {vocab_size}")
Here, we’re converting the text to lowercase and splitting it into words. After that, we create a list of unique words to serve as our vocabulary.
Step 4: Map Words to Numbers
Computers work with numbers, not words. So, we’ll map each word to an index and create a reverse mapping too (this will help when we convert them back to words later).
word_to_idx = {word: idx for idx, word in enumerate(vocab)}
idx_to_word = {idx: word for word, idx in word_to_idx.items()}
# Convert the words in the corpus to indices
corpus_indices = [word_to_idx[word] for word in words]
Basically, we’re just turning words into numbers that our model can understand. Each word gets its own number, like “AI” might become 0, and “learning” might become 1, depending on the order.
Step 5: Building the Model
Now, let’s get to the heart of it: building the bigram model. We want to figure out the probability of one word following another. To do that, we’ll count how often each word pair (bigram) shows up in our dataset.
# Initialize bigram counts matrix
bigram_counts = np.zeros((vocab_size, vocab_size))
# Count occurrences of each bigram in the corpus
for i in range(len(corpus_indices) - 1):
current_word = corpus_indices[i]
next_word = corpus_indices[i + 1]
bigram_counts[current_word, next_word] += 1
# Apply Laplace smoothing by adding 1 to all bigram counts
bigram_counts += 0.01
# Normalize the counts to get probabilities
bigram_probabilities = bigram_counts / bigram_counts.sum(axis=1, keepdims=True)
print("Bigram probabilities matrix: ", bigram_probabilities)
Here’s what’s happening:
We’re counting how often each word follows another (that's the bigram).
Then, we turn those counts into probabilities by normalizing them.
In simple terms, this means that if "AI" is often followed by "is," the probability for that pair will be higher.
Note: When we use bigram_count += 0.01, we're applying Laplace smoothing with a small adjustment to avoid zero probabilities when certain word pairs don't appear in the corpus. This ensures that every word pair has a slightly positive probability, even if it’s rare, and helps prevent issues like division errors during the normalization process. By using a smaller value like 0.01, we strike a balance between avoiding zeros and not overly inflating probabilities for unseen word pairs.
Step 6: Predicting the Next Word
Let’s now test our model by making it predict the next word based on any given word. We do this by sampling from the probability distribution of the next word.
def predict_next_word(current_word, bigram_probabilities):
word_idx = word_to_idx[current_word]
next_word_probs = bigram_probabilities[word_idx]
next_word_idx = np.random.choice(range(vocab_size), p=next_word_probs)
return idx_to_word[next_word_idx]
# Test the model with a word
current_word = "ai"
next_word = predict_next_word(current_word, bigram_probabilities)
print(f"Given '{current_word}', the model predicts '{next_word}'.")
This function takes a word, looks up its probabilities, and randomly selects the next word based on those probabilities. If you pass in "AI," the model might predict something like "is" as the next word.
Step 7: Generate a Sentence
Finally, let's generate a whole sentence! We’ll start with a word and keep predicting the next word a few times.
def generate_sentence(start_word, bigram_probabilities, length=5):
sentence = [start_word]
current_word = start_word
for _ in range(length):
next_word = predict_next_word(current_word, bigram_probabilities)
sentence.append(next_word)
current_word = next_word
return ' '.join(sentence)
# Generate a sentence starting with "artificial"
generated_sentence = generate_sentence("artificial", bigram_probabilities, length=10)
print(f"Generated sentence: {generated_sentence}")
This function takes an initial word and predicts the next one, then uses that word to predict the following one, and so on. Before you know it, you’ve got a full sentence!
Output
santhoshs-mbp:bigram-language-model santhosh1$ python3 main.py
Vocabulary: ['electricity.', 'artificial', 'future', 'industries', 'intelligence', 'ai.', 'future.', 'new', 'the', 'machine', 'is', 'learning', 'of', 'transforming', 'ai', 'shaping', 'and']
Vocabulary size: 17
Bigram probabilities matrix: [[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.11111111 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.11111111 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.11111111]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.11111111 0.05555556 0.05555556]
[0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353
0.05882353 0.05882353 0.05882353 0.05882353 0.05882353 0.05882353
0.05882353 0.05882353 0.05882353 0.05882353 0.05882353]
[0.11111111 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05 0.05 0.1 0.05 0.05 0.05
0.1 0.1 0.05 0.05 0.05 0.05
0.05 0.05 0.05 0.05 0.05 ]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.11111111
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05 0.05 0.05 0.05 0.05 0.05
0.05 0.05 0.15 0.05 0.05 0.05
0.05 0.1 0.05 0.05 0.05 ]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.11111111
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.11111111 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.11111111 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.11111111 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556]
[0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.05555556 0.05555556 0.05555556
0.05555556 0.05555556 0.05555556 0.11111111 0.05555556]]
Given 'ai', the model predicts 'ai.'.
Generated sentence: artificial ai. of electricity. ai. artificial future learning artificial of ai.
🚀📂 Check out the complete code for the Bigram Language Model on GitHub.
Wrapping Up
There you have it—a simple bigram language model built from scratch using just Python and NumPy! 🧑💻✨ No fancy libraries needed, and now you’ve got a basic idea of how AI can predict text. Feel free to experiment, tweak the code, or even extend it with more advanced models! 🚀
Give it a go, and let me know how it turns out. Happy coding! 😄💻
PS: Got questions or thoughts? 💡 Drop them in the comments! Let’s keep the conversation going! 🔄
🌐 You can also learn more about my work and projects at https://santhoshvijayabaskar.com
Posted on October 18, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.