Perfecting Prismia Clustering via Deep Learning, Part 3
By Jackson Brandberg, Ilina Mitra, and Alaena Roberds
Review of previous work:
Over the past few weeks, we have been working to improve sentence embeddings through deep learning to eventually improve message clustering in Prismia, an ed-tech virtual chat interface that allows students and instructors to have dynamic interactions over course material. In Part 1 of our blog series, we performed exploratory data analysis of the current clustering mechanism, created a testing dataset and evaluation metric for our efforts moving forward, and ran our baseline S-BERT model. In Part 2, we discussed some roadblocks we encountered in preprocessing our deep learning textbook, which we hope to use to improve deep-learning-specific sentence embeddings. After discussing with course instructors and teaching assistants, we now have a better understanding of how to clean the textbook material and train our model on it. In this third and final installment of our blog series, we will detail scraping our textbook, creating a training dataset, and fine-tuning our language models. We can’t wait to share this with you!
Scraping text from a deep learning textbook (take 2!):
As mentioned above, we met with both course instructors and teaching assistants to develop a more structured path forward with regards to preprocessing. Instead of transforming our text into a PDF and extracting text (as we had done in Part 2), we cloned the markdown files found on the d2l.ai GitHub into our Colab notebook.
After this, we began the (somewhat tedious!) task of cleaning the data. There were a significant number of random symbols to remove as well as big code and LaTeX blocks that are found within the text. This was by far the most challenging and time consuming portion of the project, but would be necessary to achieve clean sentences for our sentence-transformer training set. While removing the standalone symbols was a fairly straightforward process, it was more complicated to remove code blocks because in many cases it is hard to distinguish code from actual text. After testing our preprocessing code on one specific chapter of the textbook and being satisfied with the results, we created a function, preprocess_text (pictured below), to do this for all necessary chapters in the textbook.
For our first pass, we selected a subset of textbook chapters to clean, due to the sheer number of chapters and sections in the textbook. It is important to note that given that we are manually cleaning the text, the cleaning is imperfect.
Setting up a structured data set:
In blog post 2, we found that it might be challenging to input completely unstructured data into our model. After meeting with our instructors, we confirmed that S-BERT requires a labeled data-set in order to train. Therefore, in order to use our deep learning textbook to improve our sentence embeddings, we would need to create a structured, labeled data set.
We revisited the original SBERT paper in order to structure our input data in a similar manner. In Section 3.1 of the paper, the authors allude to a number of sentence pairs “annotated with the labels contradiction, entailment, and neutral” (Reimers and Gurevych 4). We took this basic approach and devised an idea for structured data that worked best for our model.
Our dataframe is structured as follows:
1. Two sentences are picked randomly from the preprocessed text and populate columns A and B accordingly
2. If the sentences are from the same chapter in the textbook, a similarity score of 1 is given and appended to the labels column. If the sentences are from different chapters in the textbook, a similarity score of 0 is given and appended to the labels column.
In order to create this dataframe, we needed to first clean relevant book chapters. We cleaned chapters that we spent significant time on during class to create our dataset. As such, chapters on 1. Attention, 2. Computer Vision, 3. Convolutional Neural Networks, 4. Optimization, 5. Deep Learning, 6. Recurrent Neural Networks, 7. Recommender Systems, 8. Introduction, 9. Multilayer Perceptrons.
We started by building two lists, one with each sentence in our subset of the textbook, and another with each sentence’s respective chapter number. We did this by looping through each path (each subsection of a chapter has its own path) within each chapter, preprocessing one section at a time.
Finally, we were able to randomly generate two indices from a uniform distribution, and check if their chapter indices were the same, and append their label accordingly. We chose to start with a training set of 100,000 sentence pairs.
Building the Dataset
You can find a portion of the dataframe below, to get a more visual understanding of what we are working with throughout the remainder of this blog. As is evident from the second screenshot, this data is pretty heavily skewed and most labels are 0. That being said, for the purposes of this project what matters more than an even distribution is that the model (described below) properly learns these labels, regardless of the distribution.
Similar v. Dissimilar Count
Training our original model with the preprocessed text (Attempt 1):
Much of our setup for training can be attributed to this repository, specifically the training_stsbenchmark_continue_training.py file, which outlines how to take a pre-trained BERT model, like bert-base-nli-mean-tokens, and train it on additional data to fine tune.
Now that we have a dataset to work with, it’s time to run it through training to fine-tune our BERT model. First, we do have to do some additional preprocessing of our dataset to get it in a workable format. Type “transformers.data.processors.utils.InputExample” is what we ultimately want to feed through our model. The way we (and the tutorial) did it is by running through a for-loop of our dataframe, assigning the texts to be sentence A and sentence B, and the label to be labels for each row in our dataframe. This will produce a set of text sentence pairs, with their corresponding labels.
The next few steps are relatively straightforward; we put our new training_samples as defined above and convert them to a DataLoader class. DataLoader is just an iterable over a dataset, in which we can also define batch_size and shuffling for our model. We also want to define our loss as CosineSimilarityLoss, as that’s the evaluation metric for our data that we are trying to minimize.
We used a batch size of 16, with 4 epochs and our evaluation steps = 1000. We can further tune these hyperparameters to try and increase performance. We wanted to try fine-tuning both the bert-base-nli-mean-tokens and the stsb-roberta-large model, which gave us some model improvement already on its own last week. Now it’s time to evaluate our model and see how it performs after our fine-tuning.
Results from Attempt 1:
We can see from the screenshot below, that despite our best efforts the cosine similarity score was approximately 0.488, significantly worse than our baseline model. As a reminder, Part 1 defines the similarity score as the cosine similarity between two matrices representing sentence embeddings of our test-dataset. The first matrix is the pre-defined, binary matrix with 0s and 1s representing clusters between similar sentences, and the second representing the cosine-similarity scores between the sentence-embeddings of each sentence in our test set. As such, a higher similarity score metric represents a model that is performing better at our desired task of properly recognizing similarities between our test messages.
Training on a Semantic Textual Similarity Model (Attempt 2):
We were disheartened by this first attempt but adamant to press on. For our next model, we decided to once again leverage from the GitHub repository linked above. This time we leveraged a model optimized for Semantic Textual Similarity (STS). STS models are constructed with the purpose of determining how similar two code blocks are. Given that this is the task we are trying to optimize in Prismia, we were hopeful that a model trained for the same purpose would improve our cosine-similarity score.
For this project, we trained the model on our constructed dataset for two epochs (to prevent the RAM from crashing) and then calculated our cosine similarity score, as we did above.
Results from Attempt 2:
Unfortunately, the cosine similarity score in the STS model was even lower than the baseline and our first model. After some brainstorming as a team, we realized that the issue may be our unbalanced dataset, described above. The next portion of the blogpost outlines changes we made to our dataset and other amendments we made to the model while training, all in hopes of increasing our similarity score.
We quickly realized that we would have to try a number of combinations of model types, training set sizes/balance, epochs. Instead of explaining every iteration in this blog post (there were many!), we are including this table summarizing our efforts.
It occurred to us that perhaps we were receiving poor sentence-embedding similarity scores because our dataset of sentence pairs was so heavily skewed towards having dissimilar sentences. An unbalanced dataset in this way may be training the model to think that relatively similar sentences (they’re both about deep learning, after all!) are actually dissimilar. The dataset was unbalanced because we drew from the uniform distribution for each sentence, and therefore two sentences rarely came from the same of nine chapters. Therefore, we considered sampling from a different distribution in order to create a dataset whose labels were more evenly split between 0 and 1. In the table below, which details a number of changes we made and their respective results, you can see this new dataset referred to as ‘balanced’ and the old dataset referred to as ‘unbalanced.’
After meeting with a TA, we considered that perhaps our testing set matrix was fundamentally wrong in the sense that we were assigning the same similarity scores to sentences with themselves as we were to sentences in the same cluster. To fix this, we recreated our test-matrix to hold 1s along the diagonal (representing sentence similarity scores with themselves), 0.8s for sentences in the same cluster (an arbitrary decimal to signify ‘similar but not the same) and 0s otherwise. Again, we acknowledge that this method is imperfect, but perhaps it would be slightly less-imperfect than just the binary representation. When re-running our models to find a similarity score with this new test-matrix, we found that the results were still quite poor.
We believe there are a few main changes and updates we could make to the model, provided we had more time. The first would be to train on a more expanded corpus, whether that be including more d2l.ai textbook chapters or choosing a new, or even additional, text. In building a more robust input dataframe, we could also consider being more discerning about determining which sentences are actually similar. In this iteration of our project, we decided that sentences in the same chapter are similar. However, it is easy to see that this method is flawed and that building a more accurate model would involve constructing a more discerning method to determine similarity.
It might also be prudent to construct a more thorough test set; one that includes clustering of a wider range of sentences pertaining to deep learning. Similar to the suggestion above, we could also construct our test matrix more thoughtfully, as we attempted to do, with even-more precise similarity scores but such an effort would be difficult to do objectively.
At the beginning of this project, we were really excited by the prospect of perfecting the clustering algorithm and wanted to make the user experience for future students even more robust than ours. While we didn’t quite meet this personal goal, we are really happy with the work we conducted over the past three weeks. Having the opportunity to work alongside course instructors on an interface that has come to shape our graduate school experience was extremely rewarding. It is not often that as students, we have the opportunity to take a peek behind the curtain and see how things operate “behind the scenes.” The ability to do so was an invaluable experience and one that we certainly will not forget. Thanks for joining us on this ride!