Machine Learning

(Ref. Huda Nasser, Julia Academy - Data Science)

In this chapter, we will explore the fundamentals of machine learning by working with the MNIST dataset — a classic benchmark in computer vision. The MNIST dataset consists of 70,000 handwritten digits (0 through 9), split into 60,000 training images and 10,000 testing images. Each image is a grayscale 28×28 pixel image, making it ideal for experimenting with classification models.

The Julia language offers powerful packages, including Flux.jl (for building neural networks), MLDatasets.jl (to access standard datasets) and OneHotArrays (for target batching). Throughout, the exercise we will use a set of tools (Images.jl, ImageInTerminal.jl, Plots.jl) in order to make visual cheks along the process.

The exercise in this chapter will guide you through the following steps:

Load and visualize the MNIST dataset;
Preprocess the data for model training;
Build and train a simple machine learning model (here a neural network);
Evaluate the model’s performance on unseen data.

MNIST dataset

The MNIST dataset can be retrieved from the MLDatasets.jl package. Start by loading the training dataset.

using MLDatasets
d_train = MNIST(split=:train)

What does this dataset actually look like? You can check this by typing the following commands.

using Images
using ImageInTerminal
colorview(Gray,d_train.features[:,:,1])

It turns out unclear at moments which digit is handwritten on the image. To clarify this, have a look at the label associated to the image.

d_train.targets[1]

Neural network

A neural network is a type of machine learning model inspired by the structure and function of the human brain. It is composed of layers of interconnected nodes called neurons, which work together to process data, recognize patterns, and make predictions.

At its core, a neural network learns to approximate complex functions by adjusting the weights and biases of these connections based on the data it sees.

A standard neural network for the MNIST dataset to start of with has the following structure:

Input (784) ⟶ Dense (32) ⟶ ReLU ⟶ Dense (10) ⟶ Softmax ⟶ Output (Digit 0–9)

The 28-by-28 gray-scale images are flattened into a 784 element vector. No activation function is applied at this stage — the input is just passed to the next layer. The input data is passed to a 32-neuron hidden layer which computes a weighted sum of the inputs, adds a bias, and passes the result through an activation function, here ReLU (Rectified Linear Unit) to introduce non-linearity. The output layer has 10 neurons to be consistent with the 10 possible targets (0 through 9). This being a classification task of the handwritten digit, we use a Softmax activation function to convert the outputs into probabilities that sum to 1.

Preprocess the dataset

The neural network we will be using in this exercise requires a 1D-vector of length 784 input. Start by flattening the matrices representing the images of our dataset using Flux.jl.

using Flux
v_train = Flux.flatten(d_train.features)

You should now use OneHotArrays.jl to transform the target array to vectors of 10 elements, with 1 at the index of the target digit.

using OneHotArrays
Y = onehotbatch(d_train.targets,0:9)

Set-up the neural network

The lines of code bellow are simply a translation of the neural network schematic in Julia.

m = Chain(
          Dense(28*28,32,relu),
          Dense(32,10),
          softmax
         )

What happens if we apply this neural network to one of the images?

m(v_train[:,1])

Answer

The network is not able to determine which digit is associated to this image. The weights and biases of the connections between the neurons have not get been adjusted since the neurol network we created as not yet been trained.

Training

You can start by having a look at the training function within Flux.jl in the following way.

#| output: false 
? Flux.train!

Warning

Take care when changing package version to have a look at the major changes. For instance, from version 0.14 of Flux.jl on the syntax for Flux.train! changed. Indeed, it went from Flux.train!(loss, params(model), data, opt) to Flux.train!(loss, model, data, opt_state) # using the new "setup" from Optimisers.jl.

When a neural network makes predictions (like classifying an image as a “3” instead of a “7”), we need a way to measure the difference between the predicted output and the actual (true) target.

The loss function provides this measure. It returns a numerical value that represents the “error” — the higher the value, the worse the prediction. Since we have a classification problem in this exercise, a typical loss choice is the cross-entropy loss.

loss(m,x, y) = Flux.Losses.crossentropy(m(x),y)

To properly train the neural network we wish to minimize the loss function. To do so, we will be using a variant of gradient descent called ADAM.

optimizer = Flux.setup(Adam(), m)

When training a neural network, we often need to go over the training data multiple times. Each full pass over the training data is called an epoch.

using IterTools: ncycle
dataset = ncycle([(v_train, Y)], 200)

The dataset storage constructed in the cell above tells us to train for 200 epochs. This means that the network will see the training data 200 times.

Let’s train the neural network now!

Flux.train!(loss, m, dataset, optimizer)

So, does it work better than previously on our first image?

tst = m(v_train[:,1])
cls = argmax(tst)-1
tgt = d_train.targets[1]
println("Image classified as ", cls, " with target ", tgt, ".")

Answer

We can not conclude with only one item out of the whole set. In the next section, we will analyse the performance of the neural network on unseen data. Which will allow us to conclude.

Let us now have a look under the hood of Flux.train!. What is happening in the training loop? - Take a subset of input data with associated targets: a batch; - Determine whether the model m predicts well the targets: use the loss function; - Find out which direction each model parameter should move to: compute the gradient of the loss with respect to each parameter; - Adjust the parameters using the gradients and an optimizer.

Write manually the training loop based on above stated steps using Flux.jl utilities like gradient, Flux.trainable and Flux.Optimise.update!.

#| eval: false
opt = Flux.setup(Adam(), m)
loss(x, y) = Flux.Losses.crossentropy(x, y)

# Training loop
for epoch in 1:200
    grads = Flux.gradient(m) do model
      result = model(v_train)
      loss(result, Y)
    end
    Flux.update!(opt, m, grads[1])
    println("Epoch $epoch | Loss: ", loss(m(v_train), Y))
end

What happens when you replace the Adam() optimizer by a standard Descent()? Or the loss by a Mean Square Error (MSE)? Can you find the available loss functions in the Flux.jl package documentation?

Testing

We can now evaluate our trained neural network on unseen data, the so-called test dataset.

d_test = MNIST(split=:test)
for i in 1:10
   b = d_test.features[:,:,i]
   v_b = reshape(b,784)
   a = m(v_b)
   r = argmax(a)-1
   println("Image classified as ", r, " with target ", d_test.targets[i], ".")
end

The results seem pretty good at first glance.

Generate a violin plot of the predicted probability associated to the target digit for the images in the test dataset. Use the Plots.jl documentation to do so.

As you may have noticed, the Plots.jl documentation is build upon examples. Many recent documentations begin with examples before moving on to the general definition.

using StatsPlots

# Array with probabilities associated to target digits for each image
val = [m(reshape(d_test.features[:,:,i],784))[d_test.targets[i]+1] for i in 1:length(d_test.targets)]

group = fill("MNIST-Test", length(val))

violin(group, val, legend=false, title="Violin Plot", ylabel="Target digit probability")

What do you think about the models performance based on the violin plot generated by the lines of code above?