I was recently down a dark, social media rabbit hole when I came across this post:
Clearly in no rush to finish my work, and dedicated to beating this puzzle, I stared at the image for some time. But the image was tricking my brain and I grew frustrated. Sure, I wasn’t totally lost; I recognized fur on the right, and potentially more fur on the left, and I saw what looked like three eyes in the center. Were they eyes? I was fairly certain about a red, fleece-like material at the bottom of the image. But the puzzle pieces weren’t fitting together to form a coherent image. Nothing seemed recognizable.
Finally, the right-most ‘eye’ started to resemble a snout, and I was able to turn the image, and my perspective, 90-degrees, so that I could see the face of the dog tilting its head sideways. In hindsight, almost obvious and impossible to unsee from this point on.
This task was difficult, and I realized that the way in which I approached the classification was very similar to the ways in which computers attempt to recognize images. Or rather, the way in which computers approach classifying an image closely mimics the human endeavor. Not coincidentally, this computerized classification technique is called neural network, modeled partly on the human brain.
Though I don’t like to admit it, this image fooled me. It outsmarted my brain. But why was this image so hard to identify? I couldn’t help but wonder, would a computer have an equally difficult time?
Surely, the orientation of the image played a large part. In fact, to reassure myself that this wasn’t purely my inability, I uploaded this image into the Google Cloud Vision API to see if computer vision struggled.
Google was moderately successful. It was pretty sure that it was a dog (72%). Conversely, when I rotated this image, and then uploaded it to Google’s API, it clearly had better results:
Now, it was almost 90% sure that the image is a dog. Google’s Cloud Vision API performed quite similarly to my brain; I was (eventually) able to recognize the original image as a dog, but switching my perspective proved the problem to be much easier.
Of course, in order to get this higher confidence level, it took the user manually rotating the image and then uploading it. Computers are really good at classifying images, but that’s assuming that they are looking at the image correctly. Imagine asking a computer to classify a whole set of images, say a whole album of celebrities. Well, imagine that some of these images are disoriented. The computer might fail at classifying images to which they would otherwise succeed, simply because of perspective.
This is understandably a problem, and a relatively poor accuracy might make the programmer think that a computer can’t tell the difference between various images of people. But this isn’t the issue at all; it won’t be an issue about being able to classify something, but about what should even be classified.
In order to understand whether this is possible (spoiler alert: of course it is!), we have to understand the basics of neural networks, and the ways in which they work to classify images.
An Intro to Neural Networks:
My favorite definition of neural networks, perhaps because of its elegance, comes from Michael Nielsen’s “Neural Networks and Deep Learning” book. It says, “Neural networks [are] a beautifully biologically-inspired programming paradigm which enables a computer to learn from observational data.” It is a mathematical model of the human visual cortices, with each cortex having as many as 140 million neurons, and billions and billions of interconnecting paths. Such a model might seem overwhelmingly complex. But at their core, neural networks simply take in inputs, recognize patterns from a set of training data, and return an output in the form of a prediction. Their applications range from classifying the content of images, to identifying songs from sound clips, to translating speech.
We can start by explaining a neural network with a relatively simple and approachable example — consider the problem of trying to read handwriting. Below, you will see three images of a handwritten ‘e.’ To the human brain, these are all recognizable, because we have trained and trained our brains to recognize these patterns. Like the neural network, we ‘learn’ from observational data. As children, we sit in classrooms and look at letters on a blackboard, continuously repeating the alphabet. We look through alphabet books, in a range of fonts, and sound out the letters as we point at each individual one on the page. Gradually, our brains begin to understand that the symbol with a closed ‘semicircle’ and an open-loop at the bottom is an e. Now, when reading this, you are recognizing these letters so quickly that you aren’t even realizing that each letter you read on this page is the work — and success — of our natural neural networks taking in these visual inputs and returning an output. Instantaneously.
In order for a computer to classify this symbol, it must break it into smaller and simpler problems. And that is precisely what a neural network does. That’s why you see so many nodes in the following representation of a neural network — we are breaking our input into pieces, and then breaking those pieces into pieces, until we have very detailed information about various aspects of our input.
Structure of a Neural Network:
As you can see from the diagram of a neural network above, the network is composed of three main parts; the input layer, the hidden layer(s) and the output layer. Let’s zoom way in: each circle (or node) in the hidden layer is a perceptron, meaning it takes in BINARY inputs and produces a singular binary output; it weighs a bunch of different factors given to it, and makes a decision.
For example, the following could be an example of a perceptron:
As you can see, you have three different inputs driving this decision. And each input must be weighted by its relative importance. Then, in order to actually make our decision to go to the coffee shop and order a latte, we should calculate a linear combination of these inputs; that is, we should add each input multiplied by their respective weight.
So, let’s say each of these factors were represented by a certain number. For example, my workload is an eight out of ten, and that is very important so we will weigh it by 90%. As you can see, your linear combination will also be a numerical output; let’s arbitrarily say it’s 16!
But what does 16 mean? Essentially, we will have some threshold value to make our decision based off of. Our output is supposed to be ‘Yes! Go get that coffee!’ or ‘Honey, no, you don’t need it.’ So, we can set some threshold value: if our output is over 10, we’ll get the coffee and otherwise not.
Yay! We get a coffee.
In a neural network, the ‘relative importance’ of each feature is referred to as its weight, and it is supposed to signify a numerical representation of how much we should value certain inputs versus other inputs in order to make a specific decision. This threshold value is considered the bias (though in the equation below, bias = — threshold). It converts our linear combination into a useful piece of information for our actual decision.
Here is a mathematical representation of this perceptron decision process:
Of course, as you can see, this problem would be really difficult when each factor in our decision is on a different scale. For example, say we rate our workload on a scale of 1 to 10, and our level of fatigue on a scale of 1 to 5, and the taste of the coffee on a scale of 1 to 3. While it might seem like the weights can take care of this, it wouldn’t be doing enough. More so, we want to be able to account for small changes in our inputs. Therefore, we take care of this by applying the sigmoid function to each perceptron: this scales all real numbers into the condensed continuous number space between zero and one. Therefore, instead of a ‘perceptron,’ what we really have is a sigmoid neuron.
Now, we can zoom back out a bit. The first layer of sigmoid neurons take all these inputs, and spits an output as we just discussed. The neurons in the second layer do the same thing, taking in the outputs from the first layer, and spitting out new outputs following the same exact process as before. This means that we are able to break our problem into smaller, more nuanced, and more abstract decisions, with each layer making some observation about our inputs. Eventually, we travel through each layer until we can feed a layer of sigmoid neurons into our last layer to predict the final output.
Let’s revisit our example of identifying handwritten letters. In this case, we can think of the letter as a 28-by-28 pixel image. This image can be represented numerically, with each pixel corresponding to a grayscale value between zero and one, meaning the value will be close to zero if that pixel is not ‘lit up’ and close to one if the pixel is ‘lit up.’ We can elongate this 28-by-28 pixel matrix into a 784-length vector, with each entry corresponding to one of the pixels. So, even though our letter is really only ONE item, we can really think about it as having 784 inputs.
Of course, if our goal is to classify a handwritten letter, our last layer would consist of 52 neurons — one for each upper and lowercase letter in the alphabet. But what do the hidden layers represent?
As previously stated, a neural network breaks our problem into smaller and smaller problems. Remember the initial photo of the dog with the tilted head? When I approached classifying that image, I looked for various features; I mentioned the fur, some eyes, and looked at the location of these aspects. The computer is doing the exact same thing.
When looking at a lower case e, or any letter for that matter, it can split up the letters into various building blocks. The lowercase ‘e’ has a one horizontal line in the middle, a semicircle towards the top, and a half-loop at the bottom. An uppercase ‘l’ is made of one vertical line. An ‘i’? A short line and dot.
Imagine that each node in the hidden layer corresponds to these ‘building blocks.’ We can imagine how the weights of building blocks would vary depending on which output neuron we are feeding into. As stated above, a short-horizontal line in the middle would probably have a pretty high weight going to the uppercase-A output node, as well as the lowercase e output node (and f, H, F, t).
Of course, the problem can be broken into even smaller, more specific pieces corresponding to more hidden layers. Maybe we look at where the ‘edges’ of the letters are. Or maybe we break the building blocks into even smaller, and smaller building blocks. Regardless, I think you get the idea.
Of course, training the actual neural network is a whole different story. I won’t go into the specifics for how this works, but essentially it trains by trying out a variety of different biases and weights for each node in your entire set of training observations, and seeing which model (that is, which set of combinations) gives you the most accurate results. It does this efficiently using back-propagation and gradient descent. If you’re interested in learning more, I recommend this free online textbook.
Back to the meme:
If neural networks can classify a handwritten letter, by breaking the problem into smaller and smaller parts, then you can start to understand the way it classifies an image of a dog. It learns the patterns of the pixels by observing many, MANY images and then gets to work by breaking the problem into smaller and smaller parts. Therefore, it starts to become clear that perhaps the neural network can work to predict, given an image, the ‘correct’ orientation. This all boils down to properly training the neural network; we taught our image classifier what a dog looks like when the image is correctly displayed, but in many scenarios it hasn’t been taught the possibilities of perspective, or at least not well. Nor surprising, or coincidental, our human brains can also be rigid in considering these unusual cases. Perhaps I, too, need to retrain my brain to remember that a simple shift in perspective may be the trick to clarity.