by Jackson Brandberg, Ilina Mitra, and Alaena Roberds
As introduced in the first blog post in this series, Cassava is an extremely important crop in Africa as it represents the second-largest provider of carbohydrates. As this crop is so vital, not only to the people who consume its carbohydrates but to the farmers on whose livelihood it depends, our goal of detecting and classifying Cassava Leaf Disease through machine learning is an important task. Part 1 of this series covered loading data from the Kaggle API, Exploratory Data Analysis (EDA), and a simple majority-classifier baseline model with an accuracy of 61.5%. In Part 2, of this series, we preprocessed our image data to be fit for more complex models, and built three different CNNs — a Keras Sequential Model, a GoogleNet model, and a ResNet50 model. All three of these models narrowly improved our validation accuracy scores. We left off Part 2 discussing the changes we planned to make to improve our validation accuracy. As this is the third and final installment of this series, we will discuss the changes we made both to our preprocessing and our models in order to achieve stronger results both in our validation accuracy scores and Kaggle submission score.
Adding a VGG model:
All of our previous models were performing adequately, but none of them really stood out. With that, we thought we would try our hand at a new model, VGG. As we discussed in our previous post, transfer learning is an excellent and applicable technique, because it loads a pretrained model with weights and biases to be built upon and applied to our specific problem. VGG16 was a very successful model on the ImageNet dataset, achieving approximately 97.5% accuracy after weeks of training.
Therefore, we imported the VGG16 base model and added three dense layers,with dropout to improve performance. See the implementation of the model below:
First we defined the build_vgg as follows:
Immediately, we were able to break the 70% accuracy threshold we had previously been struggling to achieve. From our original model, we made each of the following changes; we added batch normalization, we removed dropout in the layer where batch normalization was added, we changed our batch size, we changed our learning rate, we tried adding a larger dense layer, and then removing a dense layer. Unfortunately, we remained at a standstill. Each of these changes only narrowly changed (for better or worse) our validation accuracy score.
With little progress and consequently low spirit towards VGG16, we turned our efforts to fine tuning other models.
Fine-Tuning our other models (again, with little success):
Similarly to our previous discussion with VGG16, we tried fine tuning our other models to see if any adjustments in hyperparameters or model architecture would squeeze out additional validation accuracy points. Unfortunately, we were looking for quite a big jump in validation accuracy; as a refresher, our current validation accuracy scores were sitting around the 65% to 70% marks depending on the respective model. Small changes in our hyperparameters, in some cases, led to an additional percentage point at most but it wasn’t the large jump we were optimistically hoping for.
While we won’t detail every change we made, and their respective changes to evaluation metrics, the example below with our ResNet50 model was a classic example of the roadblock in which we were situated:
Initially we implemented Flatten() at the start of our head architecture:
This yielded the following results:
We then replaced Flatten() with GlobalAveragePooling2D(), which yielded the following results:
We tried replacing our batch-normalization with dropout, and saw validation accuracy as follows:
At best, our validation accuracy scores from this model tuning were in the mid-60s, and far, far lower at their worst. It became pretty clear that in order to achieve the jump in validation accuracy scores we were seeking, we needed to rethink a fundamental aspect about the way we were approaching this problem.
Previously, we had settled on resizing our images to 100 x 100 pixels. This was because we had used a for loop to read each individual image, resize it, and then append it to a new list. This process was hugely expensive from a RAM perspective. While it was clear that resizing our images to such a small size of 100 x 100 would lose us a lot of valuable information, we simply did not have large enough RAM to increase our image sizes — we attempted resizing our images to 224 x 224, and each time our notebook would crash and suggest upgrading to Colab Pro. We caved! Oops! Capitalism! Unfortunately, Colab Pro still did not offer enough RAM for our inefficient method of preprocessing our images.
We switched from our inefficient for-loop method to an image data generator, which preprocesses the images directly as it reads them using the flow feature. This allowed us to resize our images to 224 x 224, achieving a much higher level of granularity and hopefully leading to an increased ability in the models to detect the differences between various diseases.
In tandem with this change, we unfroze our base layers in our VGG16 model. We needed to do this because changing our image resolution alone did not provide a large enough accuracy jump (as seen in the results above), though we still felt confident in the potential of our VGG16 model. Therefore, we unfroze the base layer weights and re-ran our model with our saved weights from the fitting above. At this point, our VGG16 model architecture looked as follows:
Our final model / submission:
We achieved the best model accuracy with our VGG model with transfer learning. Initially, we froze the base layers of the model and trained it with a batch size of 32, a learning rate of 0.001. After training for 10 epochs, we unfroze the base layers of the VGG model. We also added a learning rate schedule to ensure the learning rate would decrease as we fine tuned and incorporated an early stopping measure to avoid overfitting. It trained for another 20 epochs and was able to achieve a validation accuracy of 83%.
We also ran a Tensorboard, which is a useful visualization tool. Here are some of the graphs we found most interesting (note that in the first two plots, the validation scores are measured by the line in blue and training scores are measured by the line in orange):
Graph 1: Accuracy over Epochs
Graph 2: Loss over Epochs
Graph 3: Model Visualization
Lessons learned and takeaways:
For us, the main takeaways really lied in the importance of preprocessing and doing so in a way that our work environment had enough memory for us to run deep models. As we conducted exploratory data analysis (EDA) and proper preprocessing, we really learned that hyperparameter tuning alone will not produce the best model. It is critical that as data scientists, we look back to our preprocessing steps to ensure that the data is organized in a way that will provide the best results later on in the pipeline.
Our EDA was critical in understanding that some of the cassava pictures were mislabelled, suggesting that a perfect model is unattainable. Furthemore, it reinforced the fact that not all problems will lend themselves to 100% accuracy. As we learned in class with Sam Watson, what matters is the underlying distribution of the data. The EDA we conducted proved this was the case in the Cassava Leaf Dissection problem.
As discussed above, our breakthrough came from preprocessing using an image data generator. This not only allowed us to run more complex models without crashing our RAM, but also allowed us to increase our accuracy by increasing our input image pixel size. While 100x100 image sizes did give us a fairly solid accuracy (approximately 75%), increasing the pixel size allows for less image distortion. Because there is no consistency of image type in the set (some are of the whole plant while others are close up of specific leaves), a smaller pixel resolution muddies the clarity and decreases predictive power. As such, it was critical that we refined our preprocessing to account for this as we worked through the project.
With additional time and memory resources, we would hope to train our models for more epochs (between 50 and 100). We were also notified of Google AI Not, which would allow us to use higher computing power to run these models. With access to this, we could create deeper models and perform additional hyperparameter tuning to strengthen our predictive model.
If given more time, we would also hope to implement TFRecords, TensorFlow’s proprietary binary storage format. When working with large datasets such as this, using a binary storage format can improve both the performance of the import pipeline and consequently, training time. Our hope moving forward is to have a better understanding of TFRecords and use that in tandem with the suggestions mentioned above to build a more robust model.