by Alaena Roberds, Ilina Mitra and Jackson Brandberg
Background and context:
Prismia is a virtual chat interface that allows students and instructors to have dynamic interactions over course material. It was mainly created to fill a gap in what most other classroom evaluators lack. Alternatives like iClicker and LearningCatalytics are pretty limited to only allowing instructors to ask multiple choice questions. Moreover, giving customizable feedback to individuals is very challenging on those platforms. Prismia offers a solution to those issues by allowing students and instructors to have more of a free-form conversation on the material, and ask questions during any points of confusion.
Prismia lessons are mainly structured with instructor messages that describe key information about the material, along with questions for the students to answer. In its current form, there are two types of questions: multiple choice and free response. Multiple choice questions have predefined answer choices and automated responses that the instructor has created. Free response questions allow for the students to type in their own answers to questions. After a student inputs their response in the chat, the instructor interface will then cluster these student responses based on their similarities. The instructor can then send response messages to certain clusters based on the feedback they want to provide for that specific answer.
During our project we hope to improve the clustering algorithm via improved language processing. We will not actually be changing the clustering method currently used. Rather, we hope to improve the clusters by improving the language processing of the messages.
Currently, the clustering method for grouping-student responses is quite simple; the current method uses a sentence-BERT model to transform the sentences and encode the messages, and then clusters are made by a simple KMeans model, grouping based on the number of clusters specified by the instructor in Prismia.
As seen in the code above, we load a pre-trained SentenceTransformer model (specifically, the ‘bert-base-nli-mean-tokens’).
As a review, Sentence-BERT (SBERT) is a Python framework for sentence, text, and image embeddings. SBERT is a twin netowker, which means it has the ability to process two sentences simultaneously. These two sentences are referred to as “twins” and have identical architectures, depicted by the chart below.
Once the sentences are encoded per this SentenceTransformer model, we can implement our clustering via KMeans. As a review, KMeans clusters data by randomly adding ‘centroids’ to the vector space in which the data lies. Then, data points are assigned clusters based on their proximity (the lowest Euclidean Distance) to the centroids.
This process is repeated, with new centroids, over and over until the clusters for the data remain the same.
On the interface side, this code results in clusters like the following:
And the clustering works! As we can see above, messages with similar meaning are in-fact grouped together. But the clustering is quite simple — responses with the word “statistics” all got grouped together, just as a verb with “data” were clustered, and messages with “business” stuck together. In fact, in the above example, all clusters shared a unique word and rarely had overlapping words with other clusters. What happens when our responses are not so simple.
We wanted to test this out! Because this project is somewhat less traditional for a deep learning sense, our exploratory data analysis (EDA) will consist of a deep understanding of the ways that the current clustering performs on different instances of real and fabricated student responses.
In the context of this blog post, the current model can be seen as the baseline model. In the next section, we will discuss how we created our test data, and how the baseline model performed on the test data.
Selecting and creating a test set:
As previously mentioned, our goal for this project is to improve the clustering, specifically when dealing with more technical responses that require niche knowledge of deep learning. In order to actually assess whether or not our efforts are improving the clustering, we will need to develop a test data set and select an evaluation metric.
We decided that the best way to proceed with this was to manually create sample responses with clear designated clusters in mind. That is, we would develop a set of sample responses, some of which were clearly sending one message, and others which were sending different messages.
After consideration, we drafted the following sample messages. The colors demonstrate the ‘clusters’ in which we believe and intend these sentences to fall.
Below is how the model is currently clustering these messages
As we can see, these clusters are quite different than the intended ones. Our goal is to fine-tune the BERT model so that it can better classify these data science concepts.
We can represent the similarity between these sentences in a matrix, where we gave 1s to sentences in the same cluster and 0s otherwise. While this is obviously simplified (similarity of sentences is really more of a scale), there isn’t a way for us to appropriately judge that scale without some sort of language model, and that is precisely what we do not want in this case because this matrix is intended to act as our point of comparison for the language models.
As previously mentioned, we will be using the SBERT language processor, which computes the cosine similarity between two sentences. As a brief review, the cosine similarity computes the dot-product of the two vectors, divided by the product of the magnitude of each vector. This yields a score between 0 and 1. Therefore, once we run our models, we can have a symmetric matrix with the cosine similarities between each sentence pair. Consequently, our loss function will be the cosine-similarity between our original binary-matrix and the cosine-similarity matrix. It is important to note that the cosine-similarity is in fact a similarity metric meaning that a higher value corresponds to a higher similarity score. Therefore, as we work to improve our language processing through the inclusion of Deep Learning text, we will search for a higher cosine-similarity score between the matrices.
Now that we had a testing mechanism in place, we wanted to test our baseline model — the current, baseline SBERT processing method described above.
We did this and achieved a cosine-similarity score of 0.555. While this doesn’t mean much on its own, it will be an excellent point of comparison moving forward!
In the next installment of this series, we will select and scrape a Deep Learning textbook and rerun our SBERT model to calculate new weights. We look forward to seeing you there!