ŷhat

Neural networks and a dive into Julia

by Eric Chiang

Learn More

Julia is an ambitious language. It aims to be as fast as C, as easy to use as Python, and as statistically inclined as R (to name a few goals). Read more about the language and why the creators are working so hard on it here: Why We Created Julia.

These claims have lead to a few heated discussions (and a few flame wars) around benchmarking along the line of "Is Julia really faster than [insert your favorite language/package here]?" I don't think I'm the person to add to that particular conversation, but what I will say is this: Julia is fun.

A few weekends ago, I made the decision to casually brush up on my neural networks. Why? Well, for starters neural networks are super interesting. Additionally, I was keen to revisit the topic given all the activity around "deep learning" in the Twittersphere.

"There has been a great deal of hype surrounding neural networks, making them seem magical and mysterious. As we make clear in this section, they are just nonlinear statistical models"

Hastie, Tibshirani, & Friedman; Elements of Statistical Learning (2008)

Not magic, just lots of interesting (or boring depending on perspective) math.

Julia

Julia turned out to be the perfect language for digging into the guts of a machine learning algorithm. Plus, I've been hankering to get my hands dirty with this exciting new language for a while

The remainder of this post outlines a neural network package I made in Julia using only Julia's standard libraries. I've resisted throwing the actual code in this post and focused primarily on showing off some cool Julia stuff and discussion of the machine learning. If you're keen to see the code inside the neural network package, you can find that on github.

Of course, first things first.

In [1]:
println("Hello, Julia world!")
Hello, Julia world!

The data set

I'll be using the UC Irvine Machine Learning Repository's wine dataset. It's a nice tabular, numeric csv file that saves us the hassle of having to acquire and clean new example data.

Download: http://ieor.berkeley.edu/~aaswani/sp14_ieor290a/homeworks/winequality-red.csv

Julia DataFrames: Getting started

Julia, like R and Python's pandas, has a data frame library specifically for these kinds of problems. I won't be showing off the more advanced features of the library, simply to reduce headaches from reading csv's and isolating columns.

Note: To install Julia packages run Pkg.add("PACKAGENAME") within the Julia interpreter.

In [2]:
# look familiar?
using DataFrames
train_df = readtable("winequality-red.csv",separator=';');
train_df
Out[2]:
DataFrame with 1599 rows, 12 columns
Columns:

fixed acidity             1599 non-null values
volatile acidity          1599 non-null values
citric acid               1599 non-null values
residual sugar            1599 non-null values
chlorides                 1599 non-null values
free sulfur dioxide       1599 non-null values
total sulfur dioxide      1599 non-null values
density                   1599 non-null values
pH                        1599 non-null values
sulphates                 1599 non-null values
alcohol                   1599 non-null values
quality                   1599 non-null values


For plots, I'm using Julia's Gadfly lib, a ggplot2-inspired, d3-powered package chock full of awesome. Let's look at the distributions of quality across our wines.

In [22]:
using Gadfly

# Get value counts of "quality" classes
_, count = hist(train_df["quality"])
class = sort(unique(train_df["quality"]))
value_counts = DataFrame(count=count, class=class)
value_counts
Out[22]:
6x2 DataFrame:
        count class
[1,]       10     3
[2,]       53     4
[3,]      681     5
[4,]      638     6
[5,]      199     7
[6,]       18     8

In [4]:
# Draw a plot
p = plot(value_counts,
         x="class", y="count",
         Geom.bar(), Guide.title("Class distributions (\"quality\")")
)
draw(PNG(14cm,10cm),p)

As shown, the dataset is comprised of six unique classes w/ most observations falling into class 5 or class 6.

Test / train

It is fairly standard procedure in machine learning to split training data into separate train and test groups. Train/test splits simply exclude a portion of the data (the test set) during training, and predict on the holdout dataset to produce an unbiased evaluation.

So let's create our train/test split! This code will look very familiar to those acquainted with pandas or R data.frame (for those less familiar, checkout some of our other awesome blog posts).

In [21]:
# it's like SciPy had a love child with R
y = vector(train_df["quality"])
X = matrix(train_df[[colnames(train_df) .!= "quality"]])

# Generate train/test split
n = length(y)
is_train = shuffle([1:n] .> floor(n * .25))

X_train,X_test = X[is_train,:],X[!is_train,:]
y_train,y_test = y[is_train],y[!is_train]

# print stuff to the user
println("Total number of observations: ",n)
println("Training set size: ", sum(is_train))
println("Test set size: ", sum(!is_train))
Total number of observations: 1599
Training set size: 1200
Test set size: 399

Julia Types: Building a standard scalar

Let's consider a toy example--modeling a company's value based on 2 variables only (# of employees and revenue). Clearly, these fields will contain numbers. But the magnitude of each will be very different (e.g. the former might have tens or hundreds while the latter may be a lot larger). This can present problems for machine learning algorithms.

Some models handle these scale differences well while others do not. Unfortunately, it's hard to say for certain how well a particular algorithm will deal with these scenarios without understanding of what's going on under the hood. That said, some make it easier than others. k-nearest-neighbors clearly cares if one feature is arbitrarily larger than another since it will have a disproportionate effect on distance. But other cases aren't so easy.

Now imagine me coding on a Sunday evening: My new neural network package looks broken; it's failing a base machine learning prediction problem. Is there a typo in my code? Is Julia behaving in a way a Python/R user wouldn't expect? Did I fundamentally misunderstand the training process? Three or four frustrating hours later, I found my answer. It turns out that neural networks are, at least partially, scale sensitive.$^*$

Word to the wise. If you're even a little unsure, scale your features.

$^*$ Technical note: I think this has to do with how weights and biases are initialized before backpropagation.

In [6]:
# Kinda like a C struct
type StandardScalar
    mean::Vector{Float64}
    std::Vector{Float64}
end

# Helper function to initialize an empty scalar
function StandardScalar()
    StandardScalar(Array(Float64,0),Array(Float64,0))
end
Out[6]:
StandardScalar (constructor with 2 methods)

The scaling strategy I'll be implementing is fairly basic. For each column, subtract out the mean and divide by the standard deviation. This won't make my classifier immune to outliers or fat tails, but it will go a long way in making the model behave better.

In [7]:
# Compute mean and standard deviation of each column
function fit_std_scalar!(std_scalar::StandardScalar,X::Matrix{Float64})
    n_rows, n_cols = size(X_test)
    std_scalar.std = zeros(n_cols)
    std_scalar.mean = zeros(n_cols)
    # for loops are fast again!
    for i = 1:n_cols
        std_scalar.mean[i] = mean(X[:,i])
        std_scalar.std[i] = std(X[:,i])
    end
end
Out[7]:
fit_std_scalar! (generic function with 1 method)

Unlike in NumPy or R, when you multiply two arrays in Julia, you get array multiplication, not element-wise multiplication.

In [8]:
array1 = [1,2,3] # These are fixed length vectors (Nx1 arrays), not lists
array2 = [4,5,6]
array1 * array2
no method *(Array{Int64,1},Array{Int64,1})
at In[8]:3

Why does that code fail? You can't multiply an $Nx1$ matrix by another $Nx1$ matrix because the dimensions don't align. Element-wise operations (broadcasters) are specified with their own syntax; a period before the operator: .*, ./, .^.

You can also transpose a matrix with a "tick" symbol. This has some interesting properties when combined with a broadcaster.

In [9]:
x = [1,2,3]
y = [1,2,3,4]
x .* y' # same behavior as NumPy (though much easier to construct)
Out[9]:
3x4 Array{Int64,2}:
 1  2  3   4
 2  4  6   8
 3  6  9  12

I can use that trick to further vectorize the transformation, making the column-wise operations much more concise (and faster).

In [10]:
function transform(std_scalar::StandardScalar,X::Matrix{Float64})
    (X .- std_scalar.mean') ./ std_scalar.std' # broadcasting fu
end

# fit and transform in one function
function fit_transform!(std_scalar::StandardScalar,X::Matrix{Float64})
    fit_std_scalar!(std_scalar,X)
    transform(std_scalar,X)
end
Out[10]:
fit_transform! (generic function with 1 method)

Okay, let's look at the end result of scaling. I'll fit the scalar on the training data then transform the test.

In [11]:
std_scalar = StandardScalar()

n_rows, n_cols = size(X_test)

# what do columns look like before scaling?
println("Column means before scaling:")
for i = 1:n_cols
    @printf("%0.3f ",(mean(X_test[:,i])))
end

X_train = fit_transform!(std_scalar,X_train)
X_test = transform(std_scalar,X_test)

# ... and after scaling?
println("\nColumn means after scaling:")
for i = 1:n_cols
    @printf("%0.3f ",(mean(X_test[:,i])))
end
Column means before scaling:
8.268 0.531 0.262 2.517 0.085 16.114 47.401 0.997 3.318 0.663 10.463
Column means after scaling:
-0.040 0.021 -0.061 -0.020 -0.061 0.030 0.037 -0.032 0.058 0.042 0.051

Great. Rather than having a wild distribution of magnitudes, we've normalized the features to similar ranges. On to the predictions!

Artificial Neural Networks

Artificial neural networks are an attractive model-type because of their ability to model non-linear relationships.

Since I haven't registered my package with the official Julia distribution, it can't be installed with Pkg.add(). However, since the Julia package system uses git to distribute libraries, if you want to grab it you can just use:

Pkg.clone("https://github.com/EricChiang/ANN.jl.git")
In [12]:
using ANN

Neural networks are comprised of layers, each of which is a set of linear models. To initialize a neural network we only have to specify the size of one layer, so called the "hidden" layer. I'll explain later what that means, but for now we'll say that the size of the hidden layer influences how precisely the model learns the training set.

The larger the hidden layer, the more likely the model is to overfit. The smaller the hidden layer, the more likely the model is to overgeneralize.

In [13]:
ann = ArtificialNeuralNetwork(12)
Out[13]:
ArtificialNeuralNetwork([],[12],[])

I've designed the neural network module to use similar syntax to scikit-learn. The first two parameters, epochs and alpha have to do with, respectively, the length and rate of training. lambda scales a penalty which attempts to control over-fitting, larger values mean a more generalized model.

In [14]:
fit!(ann,X_train,y_train,epochs=30,alpha=0.1,lambda=1e-5)

Okay, does it work?

In [15]:
y_proba = predict(ann,X_test)

y_pred = Array(Int64,length(y_test))

for i in 1:length(y_test)
    # must translate class index to label
    y_pred[i] = ann.classes[indmax(y_proba[i,:])]
end

println("Prediction accuracy: ",mean(y_pred .== y_test))
Prediction accuracy: 0.5889724310776943

Of course averages can be deceiving, and as we saw previously, the class distributions are skewed. Let's look at a confusion matrix.

In [16]:
function confusion_matrix(y_true::Array{Int64,1},y_pred::Array{Int64,1})
    # Generate confusion matrix
    classes = sort(unique([unique(y_true),unique(y_pred)]))
    cm = zeros(Int64,length(classes),length(classes))

    for i in 1:length(y_test)
        # translate label to index
        true_class = findfirst(classes,y_test[i])
        pred_class = findfirst(classes,y_pred[i])
        # pred class is the row, true class is the column
        cm[pred_class,true_class] += 1
    end
    cm
end

confusion_matrix(y_test,y_pred)
Out[16]:
6x6 Array{Int64,2}:
 0   0    0   0   0  0
 0   0    0   0   0  0
 4  16  135  67   4  1
 0   4   21  89  37  6
 0   0    2   2  11  0
 0   0    0   0   0  0

The columns represent the true classes, the rows represent the classes predicted by the model. In this case, the predictions aren't perfect but the results are significantly better than random guessing.

How does it work?

A neural network is a combination of linear models. It takes some set of input observations ($x_1, x_2, \dots, x_n$) and produces class probabilities for that observation. In the diagram below, the output layer will create estimates for three classes ($P(c_1),P(c_2),P(c_3)$).

The real magic comes from producing the intermediate results ($a_1,a_2,...,a_n$). Let's consider a single value produced by the "hidden layer" (the light blue ones). $a_1$ is calculated by feeding the input fields ($x_1, x_2, \dots, x_n$) into a linear classifier known as an artificial neuron. The same is true for $a_2$, $a_3$ and all the way up to $a_m$. Each intermediate result is the product of a unique artificial neuron, making each layer of the neural network is really just an amalgamation of linear models.

So what does an artificial neuron look like?

In [17]:
type ArtificalNeuron
    weights::Vector{Float64}
    bias::Float64
    activation_func::Function
end

function activate(an::ArtificalNeuron, x::Vector{Float64})
    # sum of weights times input, then add bias
    pre_activation = an.weights' * x + an.bias
    an.activation_func(pre_activation)
end
Out[17]:
activate (generic function with 1 method)

An artificial neuron looks surprisingly like any other linear model. Take an input value, multiply it by some weights and add a bias.

In [18]:
function sigm(a)
    1. / (1. + exp(-a))
end

# hard code in some weights and baises
weights = [1.,-3.,1.]
bias = 0.5
an = ArtificalNeuron(weights,bias,sigm)

# What are the outputs of these values?
x = [-2.,1.,0.]
activate(an,x)
Out[18]:
1-element Array{Float64,1}:
 0.0109869

Where artificial neurons differ from a least-squares regression, is in activation function. Though there are many functions used for neural networks, a common choice is the sigmoidal function. This takes the output of the linear model and squashes the value between 0 and 1.

In [19]:
# Gadfly is pretty wonderful
p = plot(sigm,-5,5,Guide.title("Sigmoidal activation function"))
draw(PNG(14cm,10cm),p)

Because the values approach 0 and 1 so quickly, this activation function allows us to think of an individual neuron as "active" (close to 1) or "inactive" (close to 0).

The price of complexity

By aggregating many linear models, neural networks is able to consider nonlinear relationships in attempt to generalize predictions. Neural networks, and strategies involving them such as deep learning, have become a major part of complex signal recognition such as machine vision.

The algorithm excels at systematically evaluating features at a speed human's just can't. For problems that necessitate high dimensional and complex feature spaces, neural networks is an immensely valuable tool.

But there is a down side. The same strength of modeling interwoven relationships curtails an important property: the ability to open up the black box and understand what's going on. Often, being able to know which features are important is as critical as the actual prediction (why do customers churn vs. will a customer churn). Neural networks aren't magic, and they're not the end all be all.

Julia, on the other hand, is totally magic.