Machine Learning Text Generation using LSTM

wmisingo

Wilbert Misingo

Posted on January 12, 2023

Machine Learning Text Generation using LSTM

INTRODUCTION

In a recent of times generative models and chatbots have been quite remarkable with its impacts vast known and used by many, but have you ever wondered how that works.

IMPLEMENTATION

In this article we would learn how to create a text generative model that would be generative texts automatically based on a give lines of prompt, thus for that we would be using the Long-Short Term Memory Algorithm, I wont go into details on explaining how the algorithm works or its architecture but rather how you can use it create a chatbot of your own from scratch.

To do so we are going to utilize the following python libraries:-

  • Tensorflow

  • Pandas

  • Numpy

  • Matplotlib

THE PROCESS

Step 01: Importing Libraries and modules

We start by importing the necessary tools. TensorFlow is the core here, providing the backbone for our machine learning operations.

import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
Enter fullscreen mode Exit fullscreen mode

Step 02: Fetching training data source

Reading data from a file ('poetry.txt') is a common practice. Lowercasing the text ensures uniformity, preventing the model from treating "Word" and "word" differently.

file = open('poetry.txt','r',errors = 'ignore')
raw = file.read()
raw = raw.lower()
Enter fullscreen mode Exit fullscreen mode

Step 03: Creating an instance of the tokenizer

The Tokenizer converts words into numerical tokens, a crucial step in preparing text data for machine learning. Think of it as a dictionary that maps words to numbers.

tokenizer = Tokenizer()
Enter fullscreen mode Exit fullscreen mode

Step 04: Creating a corpus

A corpus is just a collection of text. We split our raw text into lines, creating the foundation for our training data.

corpus = raw.lower().split("\n")
Enter fullscreen mode Exit fullscreen mode

Step 05: Getting information about the corpus

By fitting the Tokenizer on our corpus, we obtain essential information such as the total number of unique words in our dataset.

tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1
print("Your data corpus is made up of "+str(total_words)+" total words")
Enter fullscreen mode Exit fullscreen mode

Step 06: Processing the corpus

This step involves converting our text into sequences of numbers. We create n-gram sequences, which are combinations of words that help the model understand the context of the text.

input_sequences = []
for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))
Enter fullscreen mode Exit fullscreen mode

Step 07: Defining data and labels

Preparing the input sequences and their corresponding labels is crucial for training. We convert labels into a categorical format, making it easier for the model to learn.

xs, labels = input_sequences[:,:-1],input_sequences[:,-1]
ys = tf.keras.utils.to_categorical(labels, num_classes=total_words)

Enter fullscreen mode Exit fullscreen mode

Step 08: Defining and training a model

Here, we build our neural network using the Sequential API in TensorFlow. The architecture includes an Embedding layer for word representations, a Bidirectional LSTM layer for context understanding, and a Dense layer for generating the output.

model = Sequential()
model.add(Embedding(total_words, 64, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(20)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
history = model.fit(xs, ys, epochs=500, verbose=1)
Enter fullscreen mode Exit fullscreen mode

Step 09: Visualizing the training results

The training process is visualized using Matplotlib, showing the model's accuracy over each epoch. This helps us identify how well the model is learning from the data.

plt.plot(history.history['accuracy'])
plt.xlabel("Number of Epochs")
plt.ylabel('Training accuracy per epochs')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Step 10: Testing the model

The final step involves testing our trained model. We provide a seed text, and the model generates additional text based on its learned patterns.

seed_text = "Come what may"
next_words = 30

for _ in range(next_words):
   token_list = tokenizer.texts_to_sequences([seed_text])[0] 
   token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre') 
   predicted = model.predict_classes(token_list, verbose=0) 
   output_word = ""

   for word, index in tokenizer.word_index.items():
       if index == predicted:
           output_word = word
           break
   seed_text += " " + output_word

print(seed_text)
Enter fullscreen mode Exit fullscreen mode

CONCLUSION

Building a text generative model is both a science and an art. Experimenting with different architectures, datasets, and parameters will deepen your understanding and improve the model's performance. Keep exploring, and happy coding!

Do you have a project 🚀 that you want me to assist you 🤝😊: wilbertmisingo@gmail.com
Have a question or wanna be the first to know about my posts:-
Follow ✅ me on Twitter/X 𝕏
Follow ✅ me on LinkedIn 💼

💖 💪 🙅 🚩
wmisingo
Wilbert Misingo

Posted on January 12, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related