by Jackson Brandberg, Ilina Mitra, and Alaena Roberds
Cassava is one of Africa’s most important plants. It is the second-largest provider of carbohydrates (Kaggle) and is crucial to the livelihood of a large percentage of African farmers. Given the subcontinent’s reliance on this root, viral diseases affecting cassava root have the potential to set back both the health of people across the continent as well as many local/national economies.
In this series of blog posts, we will document our progress completing Cassava Leaf Disease Classification through Kaggle. It is the hope of the project creators that we can leverage neural networks to identify and diagnose whether a cassava plant is healthy, and if not, what infection affects the plant. Current disease identification methods are economically inefficient and a successful deep learning algorithm would allow farmers to diagnose decaying lives in real time and allow them to act quickly to save their crop. This post will cover data collection, exploratory data analysis and the creation of a baseline model.
We will use a dataset of approximately 21,000 images of cassava leaves, both healthy and diseased. There are four cassava leaf diseases evaluated in this project (name them), and together these images represent a realistic sample of cassava leaves in Uganda. The images were collected by farmers through a survey in conjunction with the AI laboratory at the National Crops Resources Research Institute (NaCRRI).
As a Kaggle competition, the dataset is organized in Kaggle and ready to read through the use of the Kaggle API. Therefore, we imported this preprocessed and split dataset, allowing us to start our exploratory data analysis.
To better acquaint ourselves with the data, we first created a simple bar plot to get a sense of the distribution of each disease in our dataset. Please note that we leveraged from a public Kaggle submission to help with our EDA, with some code modifications made.
From the graph below, we can see that the data is imbalanced, with most pictures presenting with Cassava Mosaic Disease (CMD).
Next, we wanted to examine the pictures themselves, to understand how each disease presents. To do so, we employed the following code, which was inspired by this kaggle submission.
This code returns a collection of images that show the typical presentation for diseased and healthy plants.
As we can see above, each image has a somewhat different composition. That is, each image is taken from a different angle, in different light, at different distances from the camera. With this in mind, exploratory data analysis in these types of image classification problems is somewhat limited. Therefore, we moved on to the creation of a baseline model.
In this project, we implemented the naive classification rule as our baseline model. Our bar chart in EDA shows that the most populous leaf type is CMD. Therefore, the baseline model predicts all cassava leaf images to be CMD. This yields an accuracy of 61.5%.
While this is only a baseline model, we submitted this version to Kaggle so that we can track our standing as we continue to modify our classification techniques.
In the next installment of this series, we will dive deeper into the process of building more complex deep learning models for classifying cassava leaves. As we do this, we will continually check our accuracy to ensure that we are improving from our baseline model.