Machine Learning Text Generation using LSTM
Wilbert Misingo
Posted on January 12, 2023
INTRODUCTION
In a recent of times generative models and chatbots have been quite remarkable with its impacts vast known and used by many, but have you ever wondered how that works.
IMPLEMENTATION
In this article we would learn how to create a text generative model that would be generative texts automatically based on a give lines of prompt, thus for that we would be using the Long-Short Term Memory Algorithm, I wont go into details on explaining how the algorithm works or its architecture but rather how you can use it create a chatbot of your own from scratch.
To do so we are going to utilize the following python libraries:-
Tensorflow
Pandas
Numpy
Matplotlib
THE PROCESS
Step 01: Importing Libraries and modules
We start by importing the necessary tools. TensorFlow is the core here, providing the backbone for our machine learning operations.
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
Step 02: Fetching training data source
Reading data from a file ('poetry.txt') is a common practice. Lowercasing the text ensures uniformity, preventing the model from treating "Word" and "word" differently.
file = open('poetry.txt','r',errors = 'ignore')
raw = file.read()
raw = raw.lower()
Step 03: Creating an instance of the tokenizer
The Tokenizer converts words into numerical tokens, a crucial step in preparing text data for machine learning. Think of it as a dictionary that maps words to numbers.
tokenizer = Tokenizer()
Step 04: Creating a corpus
A corpus is just a collection of text. We split our raw text into lines, creating the foundation for our training data.
corpus = raw.lower().split("\n")
Step 05: Getting information about the corpus
By fitting the Tokenizer on our corpus, we obtain essential information such as the total number of unique words in our dataset.
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print("Your data corpus is made up of "+str(total_words)+" total words")
Step 06: Processing the corpus
This step involves converting our text into sequences of numbers. We create n-gram sequences, which are combinations of words that help the model understand the context of the text.
input_sequences = []
for line in corpus:
token_list = tokenizer.texts_to_sequences([line])[0]
for i in range(1, len(token_list)):
n_gram_sequence = token_list[:i+1]
input_sequences.append(n_gram_sequence)
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
Step 07: Defining data and labels
Preparing the input sequences and their corresponding labels is crucial for training. We convert labels into a categorical format, making it easier for the model to learn.
xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)
Step 08: Defining and training a model
Here, we build our neural network using the Sequential API in TensorFlow. The architecture includes an Embedding layer for word representations, a Bidirectional LSTM layer for context understanding, and a Dense layer for generating the output.
model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(xs, ys, epochs=500, verbose=1)
Step 09: Visualizing the training results
The training process is visualized using Matplotlib, showing the model's accuracy over each epoch. This helps us identify how well the model is learning from the data.
plt.plot(history.history['accuracy'])
plt.xlabel("Number of Epochs")
plt.ylabel('Training accuracy per epochs')
plt.show()
Step 10: Testing the model
The final step involves testing our trained model. We provide a seed text, and the model generates additional text based on its learned patterns.
seed_text = "Come what may"
next_words = 30
for _ in range(next_words):
token_list = tokenizer.texts_to_sequences([seed_text])[0]
token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
predicted = model.predict_classes(token_list, verbose=0)
output_word = ""
for word, index in tokenizer.word_index.items():
if index == predicted:
output_word = word
break
seed_text += " " + output_word
print(seed_text)
CONCLUSION
Building a text generative model is both a science and an art. Experimenting with different architectures, datasets, and parameters will deepen your understanding and improve the model's performance. Keep exploring, and happy coding!
Do you have a project 🚀 that you want me to assist you 🤝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow ✅ me on Twitter/X 𝕏
Follow ✅ me on LinkedIn 💼
Posted on January 12, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.