Julia is an ambitious language. It aims to be as fast as C, as easy to use as Python, and as statistically inclined as R (to name a few goals). Read more about the language and why the creators are working so hard on it here: Why We Created Julia.

These claims have lead to a few heated discussions (and a few flame wars) around benchmarking along the line of "Is Julia really faster than [insert your favorite language/package here]?" I don't think I'm the person to add to that particular conversation, but what I will say is this: Julia is fun.

A few weekends ago, I made the decision to *casually* brush up on my neural networks. Why? Well, for starters neural networks are super interesting. Additionally, I was keen to revisit the topic given all the activity around "deep learning" in the Twittersphere.

"There has been a great deal of

hypesurrounding neural networks, making them seem magical and mysterious. As we make clear in this section, they are just nonlinear statistical models"Hastie, Tibshirani, & Friedman;

Elements of Statistical Learning (2008)

Not magic, just lots of interesting (or boring depending on perspective) math.

## Julia

Julia turned out to be the perfect language for digging into the guts of a machine learning algorithm. Plus, I've been hankering to get my hands dirty with this exciting new language for a while

The remainder of this post outlines a neural network package I made in Julia using only Julia's standard libraries. I've resisted throwing the actual code in this post and focused primarily on showing off some cool Julia stuff and discussion of the machine learning. If you're keen to see the code inside the neural network package, you can find that on github.

Of course, first things first.

```
println("Hello, Julia world!")
```

## The data set

I'll be using the UC Irvine Machine Learning Repository's wine dataset. It's a nice tabular, numeric csv file that saves us the hassle of having to acquire and clean new example data.

Download: http://ieor.berkeley.edu/~aaswani/sp14_ieor290a/homeworks/winequality-red.csv

## Julia DataFrames: Getting started

Julia, like R and Python's pandas, has a data frame library specifically for these kinds of problems. I won't be showing off the more advanced features of the library, simply to reduce headaches from reading csv's and isolating columns.

Note: To install Julia packages run `Pkg.add("PACKAGENAME")`

within the Julia interpreter.

```
# look familiar?
using DataFrames
train_df = readtable("winequality-red.csv",separator=';');
train_df
```

For plots, I'm using Julia's Gadfly lib, a ggplot2-inspired, d3-powered package chock full of awesome. Let's look at the distributions of `quality`

across our wines.

```
using Gadfly
# Get value counts of "quality" classes
_, count = hist(train_df["quality"])
class = sort(unique(train_df["quality"]))
value_counts = DataFrame(count=count, class=class)
value_counts
```

```
# Draw a plot
p = plot(value_counts,
x="class", y="count",
Geom.bar(), Guide.title("Class distributions (\"quality\")")
)
draw(PNG(14cm,10cm),p)
```

As shown, the dataset is comprised of six unique classes w/ most observations falling into class `5`

or class `6`

.

# Test / train

It is fairly standard procedure in machine learning to split training data into separate train and test groups. Train/test splits simply exclude a portion of the data (the test set) during training, and predict on the holdout dataset to produce an unbiased evaluation.

So let's create our train/test split! This code will look *very* familiar to those acquainted with pandas or R `data.frame`

(for those less familiar, checkout some of our other awesome blog posts).

```
# it's like SciPy had a love child with R
y = vector(train_df["quality"])
X = matrix(train_df[[colnames(train_df) .!= "quality"]])
# Generate train/test split
n = length(y)
is_train = shuffle([1:n] .> floor(n * .25))
X_train,X_test = X[is_train,:],X[!is_train,:]
y_train,y_test = y[is_train],y[!is_train]
# print stuff to the user
println("Total number of observations: ",n)
println("Training set size: ", sum(is_train))
println("Test set size: ", sum(!is_train))
```

## Julia Types: Building a standard scalar

Let's consider a toy example--modeling a company's value based on 2 variables only (`# of employees`

and `revenue`

). Clearly, these fields will contain numbers. But the magnitude of each will be very different (e.g. the former might have tens or hundreds while the latter may be a lot larger). This can present problems for machine learning algorithms.

Some models handle these scale differences well while others do not. Unfortunately, it's hard to say for certain how well a particular algorithm will deal with these scenarios without understanding of what's going on under the hood. That said, some make it easier than others. k-nearest-neighbors clearly cares if one feature is arbitrarily larger than another since it will have a disproportionate effect on distance. But other cases aren't so easy.

Now imagine me coding on a Sunday evening: My new neural network package looks broken; it's failing a base machine learning prediction problem. Is there a typo in my code? Is Julia behaving in a way a Python/R user wouldn't expect? Did I fundamentally misunderstand the training process? Three or four frustrating hours later, I found my answer. It turns out that neural networks are, at least partially, scale sensitive.$^*$

Word to the wise. If you're even a little unsure, scale your features.

$^*$ Technical note: I think this has to do with how weights and biases are initialized before backpropagation.

```
# Kinda like a C struct
type StandardScalar
mean::Vector{Float64}
std::Vector{Float64}
end
# Helper function to initialize an empty scalar
function StandardScalar()
StandardScalar(Array(Float64,0),Array(Float64,0))
end
```

The scaling strategy I'll be implementing is fairly basic. For each column, subtract out the mean and divide by the standard deviation. This won't make my classifier immune to outliers or fat tails, but it will go a long way in making the model behave better.

```
# Compute mean and standard deviation of each column
function fit_std_scalar!(std_scalar::StandardScalar,X::Matrix{Float64})
n_rows, n_cols = size(X_test)
std_scalar.std = zeros(n_cols)
std_scalar.mean = zeros(n_cols)
# for loops are fast again!
for i = 1:n_cols
std_scalar.mean[i] = mean(X[:,i])
std_scalar.std[i] = std(X[:,i])
end
end
```

Unlike in NumPy or R, when you multiply two arrays in Julia, you get array multiplication, not element-wise multiplication.

```
array1 = [1,2,3] # These are fixed length vectors (Nx1 arrays), not lists
array2 = [4,5,6]
array1 * array2
```

Why does that code fail? You can't multiply an $Nx1$ matrix by another $Nx1$ matrix because the dimensions don't align. Element-wise operations (broadcasters) are specified with their own syntax; a period before the operator: `.*`

, `./`

, `.^`

.

You can also transpose a matrix with a "tick" symbol. This has some interesting properties when combined with a broadcaster.

```
x = [1,2,3]
y = [1,2,3,4]
x .* y' # same behavior as NumPy (though much easier to construct)
```

I can use that trick to further vectorize the transformation, making the column-wise operations much more concise (and faster).

```
function transform(std_scalar::StandardScalar,X::Matrix{Float64})
(X .- std_scalar.mean') ./ std_scalar.std' # broadcasting fu
end
# fit and transform in one function
function fit_transform!(std_scalar::StandardScalar,X::Matrix{Float64})
fit_std_scalar!(std_scalar,X)
transform(std_scalar,X)
end
```

Okay, let's look at the end result of scaling. I'll fit the scalar on the training data then transform the test.

```
std_scalar = StandardScalar()
n_rows, n_cols = size(X_test)
# what do columns look like before scaling?
println("Column means before scaling:")
for i = 1:n_cols
@printf("%0.3f ",(mean(X_test[:,i])))
end
X_train = fit_transform!(std_scalar,X_train)
X_test = transform(std_scalar,X_test)
# ... and after scaling?
println("\nColumn means after scaling:")
for i = 1:n_cols
@printf("%0.3f ",(mean(X_test[:,i])))
end
```

Great. Rather than having a wild distribution of magnitudes, we've normalized the features to similar ranges. On to the predictions!

## Artificial Neural Networks

Artificial neural networks are an attractive model-type because of their ability to model non-linear relationships.

Since I haven't registered my package with the official Julia distribution, it can't be installed with `Pkg.add()`

. However, since the Julia package system uses git to distribute libraries, if you want to grab it you can just use:

```
Pkg.clone("https://github.com/EricChiang/ANN.jl.git")
```

```
using ANN
```

Neural networks are comprised of layers, each of which is a set of linear models. To initialize a neural network we only have to specify the size of one layer, so called the "hidden" layer. I'll explain later what that means, but for now we'll say that the size of the hidden layer influences how precisely the model learns the training set.

The larger the hidden layer, the more likely the model is to overfit. The smaller the hidden layer, the more likely the model is to overgeneralize.

```
ann = ArtificialNeuralNetwork(12)
```

I've designed the neural network module to use similar syntax to scikit-learn. The first two parameters, `epochs`

and `alpha`

have to do with, respectively, the length and rate of training. `lambda`

scales a penalty which attempts to control over-fitting, larger values mean a more generalized model.

```
fit!(ann,X_train,y_train,epochs=30,alpha=0.1,lambda=1e-5)
```

Okay, does it work?

```
y_proba = predict(ann,X_test)
y_pred = Array(Int64,length(y_test))
for i in 1:length(y_test)
# must translate class index to label
y_pred[i] = ann.classes[indmax(y_proba[i,:])]
end
println("Prediction accuracy: ",mean(y_pred .== y_test))
```

Of course averages can be deceiving, and as we saw previously, the class distributions are skewed. Let's look at a confusion matrix.

```
function confusion_matrix(y_true::Array{Int64,1},y_pred::Array{Int64,1})
# Generate confusion matrix
classes = sort(unique([unique(y_true),unique(y_pred)]))
cm = zeros(Int64,length(classes),length(classes))
for i in 1:length(y_test)
# translate label to index
true_class = findfirst(classes,y_test[i])
pred_class = findfirst(classes,y_pred[i])
# pred class is the row, true class is the column
cm[pred_class,true_class] += 1
end
cm
end
confusion_matrix(y_test,y_pred)
```

The columns represent the true classes, the rows represent the classes predicted by the model. In this case, the predictions aren't perfect but the results are significantly better than random guessing.

## How does it work?

A neural network is a combination of linear models. It takes some set of input observations ($x_1, x_2, \dots, x_n$) and produces class probabilities for that observation. In the diagram below, the output layer will create estimates for three classes ($P(c_1),P(c_2),P(c_3)$).

The real magic comes from producing the intermediate results ($a_1,a_2,...,a_n$). Let's consider a single value produced by the "hidden layer" (the light blue ones). $a_1$ is calculated by feeding the input fields ($x_1, x_2, \dots, x_n$) into a linear classifier known as an artificial neuron. The same is true for $a_2$, $a_3$ and all the way up to $a_m$. Each intermediate result is the product of a unique artificial neuron, making each layer of the neural network is really just an amalgamation of linear models.

So what does an artificial neuron look like?

```
type ArtificalNeuron
weights::Vector{Float64}
bias::Float64
activation_func::Function
end
function activate(an::ArtificalNeuron, x::Vector{Float64})
# sum of weights times input, then add bias
pre_activation = an.weights' * x + an.bias
an.activation_func(pre_activation)
end
```

An artificial neuron looks surprisingly like any other linear model. Take an input value, multiply it by some weights and add a bias.

```
function sigm(a)
1. / (1. + exp(-a))
end
# hard code in some weights and baises
weights = [1.,-3.,1.]
bias = 0.5
an = ArtificalNeuron(weights,bias,sigm)
# What are the outputs of these values?
x = [-2.,1.,0.]
activate(an,x)
```

Where artificial neurons differ from a least-squares regression, is in activation function. Though there are many functions used for neural networks, a common choice is the sigmoidal function. This takes the output of the linear model and squashes the value between 0 and 1.

```
# Gadfly is pretty wonderful
p = plot(sigm,-5,5,Guide.title("Sigmoidal activation function"))
draw(PNG(14cm,10cm),p)
```

Because the values approach 0 and 1 so quickly, this activation function allows us to think of an individual neuron as "active" (close to 1) or "inactive" (close to 0).

## The price of complexity

By aggregating many linear models, neural networks is able to consider nonlinear relationships in attempt to generalize predictions. Neural networks, and strategies involving them such as deep learning, have become a major part of complex signal recognition such as machine vision.

@cjordansquire afaik all challenges that involve signal data (sound or vision) are now won by deep CNNs with ReLU + dropout.

— Olivier Grisel (@ogrisel) April 5, 2014

The algorithm excels at systematically evaluating features at a speed human's just can't. For problems that necessitate high dimensional and complex feature spaces, neural networks is an immensely valuable tool.

But there is a down side. The same strength of modeling interwoven relationships curtails an important property: the ability to open up the black box and understand what's going on. Often, being able to know which features are important is as critical as the actual prediction (*why* do customers churn vs. *will* a customer churn). Neural networks aren't magic, and they're not the end all be all.

Julia, on the other hand, is totally magic.