Perfecting Prismia Clustering via Deep Learning, Part 2

by Jackson Brandberg, Ilina Mitra and Alaena Roberds

Review of week 1:

Last week, we performed some preliminary exploratory data analysis as a first step to enhancing the Prismia, a virtual chat interface that allows students and instructors to have dynamic interactions over course material. You can read more about the steps we took then in Part 1 of our blog series. In short, we selected and created a test set of responses to train the existing clustering algorithm and calculated a baseline cosine-similarity score of 0.555. In this week’s post we are going to discuss the steps we took to fine tune this model and (hopefully!) increase our similarity score, creating a more robust clustering algorithm.

Scraping text from a deep learning textbook:

Our first idea to improve the existing clustering algorithm was to have the model train on our deep learning textbook. It is our hope that when the clustering algorithm is implemented for a deep learning class, it will be able to better recognize some of the nuanced cases specific to deep learning terminology. To scrape the textbook, we used textract, a python package that allows us to scrape meaningful information from Word documents, PDFs, and PowerPoint presentations. By converting d2l.ai into a PDF and saving the file to our directory, we were able to call textract, as shown below, to create one singular block of text that we converted from bytes to a string, that our model can train on.

What we tried / didn’t work:

Although we were hopeful that the textracted output would be simple to train our model on, this was not the case. Our model, SentenceTransformers from PyTorch was trained on structured data. More specifically, the data used in the aforementioned model was two sentences and a precalculated similarity score; the NLI data set.

Breakthrough?:

As we started to lose steam, we had an idea! We decided to load in a different pretrained model from Hugging Face, one that was trained on a more math-centered corpus, to see if our similarity score would increase — and it did!

Things we are learning:

We came into this week naively optimistic, as can often be the case with exciting projects. We had expected that this week’s work would consist of simply scraping the Deep Learning textbook and then doing some basic preprocessing so that we could spin it through our model, and voila! Our work would be done.