by Jackson Brandberg, Ilina Mitra and Alaena Roberds
Review of week 1:
Last week, we performed some preliminary exploratory data analysis as a first step to enhancing the Prismia, a virtual chat interface that allows students and instructors to have dynamic interactions over course material. You can read more about the steps we took then in Part 1 of our blog series. In short, we selected and created a test set of responses to train the existing clustering algorithm and calculated a baseline cosine-similarity score of 0.555. In this week’s post we are going to discuss the steps we took to fine tune this model and (hopefully!) increase our similarity score, creating a more robust clustering algorithm.
Scraping text from a deep learning textbook:
Our first idea to improve the existing clustering algorithm was to have the model train on our deep learning textbook. It is our hope that when the clustering algorithm is implemented for a deep learning class, it will be able to better recognize some of the nuanced cases specific to deep learning terminology. To scrape the textbook, we used textract, a python package that allows us to scrape meaningful information from Word documents, PDFs, and PowerPoint presentations. By converting d2l.ai into a PDF and saving the file to our directory, we were able to call textract, as shown below, to create one singular block of text that we converted from bytes to a string, that our model can train on.
Textracting — Output!
What we tried / didn’t work:
Although we were hopeful that the textracted output would be simple to train our model on, this was not the case. Our model, SentenceTransformers from PyTorch was trained on structured data. More specifically, the data used in the aforementioned model was two sentences and a precalculated similarity score; the NLI data set.
Realizing our data was not in this format, we tapped into our available resources, including course teaching assistants. Unfortunately, we were still unable to get our model to train on the deep learning textbook.
We thought the best next step would be to implement different text extract packages to see if we could produce cleaner text. It was at this point when we gave up working with the deep learning textbook, given the complicated formatting and number of equations embedded into the pdf. We weren’t sure if the issue was the d2l.ai textbook or with an unstructured text corpus in general. With a simpler pdf and a different extraction method (now PyPDF2), we were able to produce cleaner text and tokenize accordingly.
As seen above, a cleaner text model can be obtained, but PyPDF2 has its limitations. Specifically, it works well for extracting specific pages of text from a file, rather than extracting text from a file in its entirety. Moreover, these pdf extraction methods are still temporary solutions to an ideal workflow. It would be preferable to extract text directly through an API or linked format so a downloaded file is not required to fine-tune the model.
With these roadblocks, we decided time was better spent researching and familiarizing ourselves with existing BERT models. Given how new BERT is (it was created in 2018), implementation is not as well documented. This is especially true for our project given the unstructured nature of our data and the unsupervised nature of the problem.
As we started to lose steam, we had an idea! We decided to load in a different pretrained model from Hugging Face, one that was trained on a more math-centered corpus, to see if our similarity score would increase — and it did!
As you can see from the screenshot above, our similarity score increased from roughly 55% to approximately 64%. Although we know that using pretrained weights is not a long term solution to the problem at hand, we are hopeful after this small win that there is room for improvement in the coming week.
Things we are learning:
We came into this week naively optimistic, as can often be the case with exciting projects. We had expected that this week’s work would consist of simply scraping the Deep Learning textbook and then doing some basic preprocessing so that we could spin it through our model, and voila! Our work would be done.
Of course, we quickly learned this was not the case. It became clear that before we could tackle the fine-tuning, we needed to get a much better understanding of how BERT or S-BERT models are fine-tuned. How do we fine-tune our embeddings? Did we want to fine-tune word embeddings for sentence-embeddings, given that we were using a S-BERT model with the goal of predicting similarities between sentences. It seems like we want to improve sentence-embeddings, but our data is completely unstructured and we do not separate sentences. Is this something we need to do? Can we do this, given our text was extremely noisy from the pdf conversion? Is it even possible to just train an S-BERT model on completely unstructured text?
In many ways, this week opened up more questions than answering them, but we still believe that we have made very substantial and necessary progress towards our goal of improving clustering of Prismia student responses through natural language processing. Because so much of our work this week became research on past projects that attempted similar tasks, we have linked these useful resources for your benefit.
We look forward to seeing you in Part 3, where we’ll have more answers and less questions.