Perfecting Prismia Clustering via Deep Learning, Part 3

By Jackson Brandberg, Ilina Mitra, and Alaena Roberds

Review of previous work:

Scraping text from a deep learning textbook (take 2!):

After this, we began the (somewhat tedious!) task of cleaning the data. There were a significant number of random symbols to remove as well as big code and LaTeX blocks that are found within the text. This was by far the most challenging and time consuming portion of the project, but would be necessary to achieve clean sentences for our sentence-transformer training set. While removing the standalone symbols was a fairly straightforward process, it was more complicated to remove code blocks because in many cases it is hard to distinguish code from actual text. After testing our preprocessing code on one specific chapter of the textbook and being satisfied with the results, we created a function, preprocess_text (pictured below), to do this for all necessary chapters in the textbook.

For our first pass, we selected a subset of textbook chapters to clean, due to the sheer number of chapters and sections in the textbook. It is important to note that given that we are manually cleaning the text, the cleaning is imperfect.

Setting up a structured data set:

We revisited the original SBERT paper in order to structure our input data in a similar manner. In Section 3.1 of the paper, the authors allude to a number of sentence pairs “annotated with the labels contradiction, entailment, and neutral” (Reimers and Gurevych 4). We took this basic approach and devised an idea for structured data that worked best for our model.

Our dataframe is structured as follows:

1. Two sentences are picked randomly from the preprocessed text and populate columns A and B accordingly

2. If the sentences are from the same chapter in the textbook, a similarity score of 1 is given and appended to the labels column. If the sentences are from different chapters in the textbook, a similarity score of 0 is given and appended to the labels column.

In order to create this dataframe, we needed to first clean relevant book chapters. We cleaned chapters that we spent significant time on during class to create our dataset. As such, chapters on 1. Attention, 2. Computer Vision, 3. Convolutional Neural Networks, 4. Optimization, 5. Deep Learning, 6. Recurrent Neural Networks, 7. Recommender Systems, 8. Introduction, 9. Multilayer Perceptrons.

We started by building two lists, one with each sentence in our subset of the textbook, and another with each sentence’s respective chapter number. We did this by looping through each path (each subsection of a chapter has its own path) within each chapter, preprocessing one section at a time.

Finally, we were able to randomly generate two indices from a uniform distribution, and check if their chapter indices were the same, and append their label accordingly. We chose to start with a training set of 100,000 sentence pairs.

Building the Dataset

You can find a portion of the dataframe below, to get a more visual understanding of what we are working with throughout the remainder of this blog. As is evident from the second screenshot, this data is pretty heavily skewed and most labels are 0. That being said, for the purposes of this project what matters more than an even distribution is that the model (described below) properly learns these labels, regardless of the distribution.

Similar v. Dissimilar Count

Dataset Output

Training our original model with the preprocessed text (Attempt 1):

Now that we have a dataset to work with, it’s time to run it through training to fine-tune our BERT model. First, we do have to do some additional preprocessing of our dataset to get it in a workable format. Type “” is what we ultimately want to feed through our model. The way we (and the tutorial) did it is by running through a for-loop of our dataframe, assigning the texts to be sentence A and sentence B, and the label to be labels for each row in our dataframe. This will produce a set of text sentence pairs, with their corresponding labels.

The next few steps are relatively straightforward; we put our new training_samples as defined above and convert them to a DataLoader class. DataLoader is just an iterable over a dataset, in which we can also define batch_size and shuffling for our model. We also want to define our loss as CosineSimilarityLoss, as that’s the evaluation metric for our data that we are trying to minimize.

We used a batch size of 16, with 4 epochs and our evaluation steps = 1000. We can further tune these hyperparameters to try and increase performance. We wanted to try fine-tuning both the bert-base-nli-mean-tokens and the stsb-roberta-large model, which gave us some model improvement already on its own last week. Now it’s time to evaluate our model and see how it performs after our fine-tuning.

Results from Attempt 1:

Training on a Semantic Textual Similarity Model (Attempt 2):

Due to memory constraints, we implemented the stsb-roberta-base model, which was originally trained on the SNLI and Multi NLI datasets and fine tuned on a STS benchmark training set.

For this project, we trained the model on our constructed dataset for two epochs (to prevent the RAM from crashing) and then calculated our cosine similarity score, as we did above.

Results from Attempt 2:

Continued Efforts:

It occurred to us that perhaps we were receiving poor sentence-embedding similarity scores because our dataset of sentence pairs was so heavily skewed towards having dissimilar sentences. An unbalanced dataset in this way may be training the model to think that relatively similar sentences (they’re both about deep learning, after all!) are actually dissimilar. The dataset was unbalanced because we drew from the uniform distribution for each sentence, and therefore two sentences rarely came from the same of nine chapters. Therefore, we considered sampling from a different distribution in order to create a dataset whose labels were more evenly split between 0 and 1. In the table below, which details a number of changes we made and their respective results, you can see this new dataset referred to as ‘balanced’ and the old dataset referred to as ‘unbalanced.’

After meeting with a TA, we considered that perhaps our testing set matrix was fundamentally wrong in the sense that we were assigning the same similarity scores to sentences with themselves as we were to sentences in the same cluster. To fix this, we recreated our test-matrix to hold 1s along the diagonal (representing sentence similarity scores with themselves), 0.8s for sentences in the same cluster (an arbitrary decimal to signify ‘similar but not the same) and 0s otherwise. Again, we acknowledge that this method is imperfect, but perhaps it would be slightly less-imperfect than just the binary representation. When re-running our models to find a similarity score with this new test-matrix, we found that the results were still quite poor.

Future Efforts:

It might also be prudent to construct a more thorough test set; one that includes clustering of a wider range of sentences pertaining to deep learning. Similar to the suggestion above, we could also construct our test matrix more thoughtfully, as we attempted to do, with even-more precise similarity scores but such an effort would be difficult to do objectively.