Stairway to Heaven By Shakespeare using NLP

8 min readJun 5, 2021

I recently took 2 days long workshop on Deep Learning as a part of an event conducted by Microsoft Learn Student Chapter at Thapar Institute of Engineering and Technology. It was an extremely concise workshop considering the number of topics I covered, but who am I kidding Deep Learning and small tutorials never go hand in hand :P

One of the major challenges that I had to solve was teaching and introducing a topic as heavy as Deep Learning and Natural Language Processing to an audience that had little to no introduction to programming and algorithms in the first place. My end goal was to spark an interest of the attendees in the world of Machine Learning and show them a glimpse of the potential this field holds.

To do the same, I revolved the workshop around a project that I created. A project that covered a lot of beginner-level concepts of Deep Learning and NLP and gave them a practical as well as theoretical run down of the same. I came up with a sequence model and trained it on Shakespeare Sonnet 1. Using that I generated a text of 100 words using Stairway to Heaven as the seed text. (I know I committed a sin. No one meddles with Stairway to Heaven but I couldn’t help it, this song plays in my head all day long smh)

In this blog(which happens to be my very first out of the drafts xD), I am going to implement the same project and will explain in parts, the execution and functioning of the code.

Concepts Covered

Define a suitable Architecture
Data Preprocessing
Training and Testing
Hyperparameter Tuning

Code

Importing the required Libraries and Functions

from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout, Bidirectional
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam
from tensorflow.keras import regularizers
import tensorflow.keras.utils as ku
import numpy as np

This is the first section of my code wherein all the required functions, classes, and libraries are imported. I am using the TensorFlow library instead of PyTorch for implementing my sequence model because I do not need a lot of variations in my model and TF helps me achieve that faster. And of course Numpy for mathematical operations on the data. The TF functions that I’ve imported will be explained later as we encounter them in the code.

Data Preprocessing

tokenizer = Tokenizer()!wget --no-check-certificate \
https://storage.googleapis.com/laurencemoroney-blog.appspot.com/sonnets.txt \
-O /tmp/sonnets.txtdata = open('/tmp/sonnets.txt').read()corpus = data.lower().split("\n")
tokenizer.fit_on_texts(corpus)
total_words = len(tokenizer.word_index) + 1

In the above code block, the “wget” command is used to fetch the Shakespeare Sonnet 1 that is uploaded on the given address to a text file sonnets.txt in my tmp folder. This file is then later read and stored in the variable “data” which will be used to train our model after we process it. Tokenizer, as the name suggests tokenizes the data. That is, it breaks down the sentences into words. “corpus” stores the data after the whole text-sentences are converted into lower case and split at new-lines. It now contains individual lines of the sonnet as one element in the list. Tokenizer is then used on the same to further break the lines down to words and assign unique words with unique integers.

Screen Capture of the Workshop Presentation illustrating Tokenizing

# create input sequences using list of tokens
input_sequences = []for line in corpus:
    token_list = tokenizer.texts_to_sequences([line])[0]
    for i in range(1, len(token_list)):
        n_gram_sequence = token_list[:i+1]
        input_sequences.append(n_gram_sequence)

Now, in most of the applications of NLP, we almost always face a problem of limited data which might not be enough to train our model effectively. so we use the existing text that we have and break it down into multiple parts to generate more information from the same piece of text. This technique is called N-Gram sequencing which has been done in the code by iterating over each of the lines, and the sequences hence generated are added to the “input_sequence” list.

Screen Capture of the Workshop Presentation illustrating N-Gram Sequencing

# pad sequences
max_sequence_len = max([len(x) for x in input_sequences])
input_sequences = np.array(pad_sequences(input_sequences, maxlen=max_sequence_len, padding='pre'))

The data, whenever fed to a model show be homogeneous and well processed so that the model doesn’t start creating irrelevant patterns out of the input based on input sizes that do not contribute to the meaning of the text. Therefore to overcome this issue, we pad the “input_sequence” by adding 0’s in front of them.

Screen Capture of the Workshop Presentation illustrating Padding of Input Sequences

# create predictors and label
predictors, label = input_sequences[:,:-1],input_sequences[:,-1]
label = ku.to_categorical(label, num_classes=total_words)

Now that our data is clean and homogeneous enough to be fed to our model, we need to label the data as input and outputs. I won’t be explaining why this needs to be done in this blog, you can refer to my workshop recording for a proper explanation along with some other fundamental concepts of Deep Learning and NLP as well.

Screen Capture of the Workshop Presentation illustrating Labelling of Data

Architecture of the Model


model = Sequential()
model.add(Embedding(total_words, 100, input_length=max_sequence_len-1))
model.add(Bidirectional(LSTM(150, return_sequences = True)))
model.add(Dropout(0.2))
model.add(LSTM(100))
model.add(Dense(total_words/2, activation='relu', kernel_regularizer=regularizers.l2(0.01)))
model.add(Dense(total_words, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())

The model architecture happens to be at the very core of any Machine Learning application, it needs to be well thought of and well structured because it directly affects the output. Any neural network consists of multiple layers that are responsible for contributing and computing their part in the parametric weights. I’ll be explaining the layers I’ve used and their functions concisely below:

Embedding Layer: A word embedding is a class of approaches for representing words and documents using a dense vector representation. In an embedding, words are represented by a dense vector where a vector represents the projection of the word into a continuous vector space. Simply put, it plots the words of similar meaning together and returns their positions in the form of vectors.

Screen Capture of the Workshop Presentation illustrating the functioning of Embedding Layer

Long Short Teem Memory(LSTM): Humans don’t start thinking from scratch every second. As you read this sentence, you understand each word based on your understanding of previous words. Your thoughts have persistence. LSTM Architecture basically tries to replicate that persistence of thoughts in your model while training it.

Screen Capture of the Workshop Presentation illustrating the importance of LSTMs

Bidirectional LSTMs: While training, this layer basically runs through the input, in both forward and backward direction instead of the traditional LSTM forward-only direction. This trains the model in such a way that in the future when it receives input it takes into account the past as well as the future relevance.

Screen Capture of the Workshop Presentation illustrating the importance of Bidirectional LSTMs

Dropout Layer: This layer shuts off a specified number of neurons for better results. It prevents overfitting and unwanted neuron emphasis. Dropping out neurons is a part of hyperparameter tuning.
Dense Layer: It is a NN layer that is connected deeply. It receives input from all the neurons in its previous layer. It is also the most commonly used layer in a NN for defining the structure.
ReLU and Softmax: They are activation functions. They help us in determining the results in the form they are supposed to be. They help in providing non-linearity.

Training the model

history = model.fit(predictors, label, epochs=100, verbose=1)

This code statement fits the data onto my architecture and thereby trains my model with the pre-processed data that we prepared in the initial stage. Epoch is nothing but a hyperparameter that defines the number of times my model will be trained on the data set (in this case 100 times).

With this, the training of my sequence model is complete and it is ready to be used for text prediction.

*Shakespeare picks up the pen and the bass guitar*

seed_text = "Buying a stairway to heaven"
next_words = 100for _ in range(next_words):
    token_list = tokenizer.texts_to_sequences([seed_text])[0]
    token_list = pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')
    predicted = model.predict_classes(token_list, verbose=0)
    output_word = ""
    for word, index in tokenizer.word_index.items():
        if index == predicted:
        output_word = word
    break
seed_text += " " + output_word
print(seed_text)

I am using “Buying a stairway to Heaven” as my seed text, i.e., the text on the basis of what my sequence model will predict the next 100 words (as specified by me). The next few statements of my code perform the same data pre-processing on this seed text and then feed it to the model. The words hence predicted are then stored in the “output_word” and then later concatenated with the original seed text. Here is the output of this code:

Buying a stairway to heaven though thou dost more lack more part pride long glory from their woe shall ride impute feast hour her part brought thee by men age much best still grow new pride told from thine need true thine own thy breast in their days will ‘ be told he was so plight chary forsworn her treasure foes about view bright keep confounds room only need eye to hell still go than ground other about prove confounds serving young thee arising lust new ‘ mistaking bow subjects place her substance twain gone so bright confounds she forsworn me alone so bright hence

It does not make absolute sense but yeah it something and the final text predicted definitely matches Shakespeare’s style of writing! You can change the style by changing what the model is trained on (I recommend some long pieces of literature) and by changing the seed text you can bring variations on the output of this Sequence Model.

So there you go, an NLP model that writes songs by poorly imitating Shakespeare’s writing style :P

Wrapping up

This blog was an overview of the whole project in which I tried to decrypt a few concepts of NLP and Deep Learning. If you are a beginner or a casual reader, you might have found a few concepts difficult to understand. I have taken this topic up from the scratch in the workshop, feel free to check that out. It also contains some resources for the people who want to get started with ML and DL.

Opinions and constructive criticism are highly appreciable. Feel free to contact me at http://pranjalgupta.tech/

Credits

This project was inspired by a discussion held by Laurence Moroney in an online YT live. So due credits to him for this project!

Stairway to Heaven By Shakespeare using NLP

Concepts Covered

Code

Wrapping up

Written by Pranjal Gupta