Machine learning—what language should I use?
By Ben Hauser, Typefi VP Engineering
Choosing a programming language for a project is an important decision with significant consequences. The most appropriate language can slash the time and effort required to solve a particular problem.
The original Typefi Engine was written in the programming language Java. It turns out this was actually a terrible decision, as Java has no direct access to the InDesign Object Model. We got around this at the time by writing a C++ InDesign plugin to expose an API for the Java program to use.
After several years of this madness we decided enough was enough and rewrote the Engine from scratch in JavaScript—a language with direct access to InDesign’s Object Model. The size of our program shrank dramatically and progress has been much swifter ever since.
It’s clear to me now that JavaScript is the best language for automating InDesign.
Machine learning
The dramatic rise of machine learning in recent years has me wondering if we can apply these modern techniques here at Typefi.
What if, instead of meticulously hand-crafting Typefi templates, you could just show the computer an existing publication and it would automatically create a matching template for you?
What if you could train a computer to automatically mark up new content to match a corpus of already-marked-up content?
Choosing the right language for this work upfront will have a dramatic effect on the success of such projects.
So I find myself asking the question: Machine learning—what language should I use?
The candidates
My three candidate languages are:
- Octave (MATLAB). This is the language chosen by Andrew Ng for his excellent Machine Learning course at Stanford. Andrew has stated this was a carefully considered decision based on his experience that students learn more quickly in this high-level language.
- Python. This seems to be the most popular choice for machine learning in industry.
- JavaScript. Since the Typefi Engine is already written in JavaScript we could apply our machine learning algorithms directly if we were to go this way.
One of the first surprises you experience when you dig in to (the current generation of) machine learning techniques is that, under the hood, they’re largely just applied linear algebra. Nothing fancy or difficult—just good old matrices and vectors from high school mathematics.
I still remember with some fondness my linear algebra textbook from Maths II in my senior year (over 25 years ago?!). It was called Matrices and Vectors and it had a floppy, green cover with yellowish paper inside.
So my titular question has now become: Linear algebra—what language should I use?
The experiment
I implemented a typical machine learning problem in each language.
The linear algebra parts were done using numpy for Python and mathjs for JavaScript.
Let’s see how it turned out. We’ll compare three key parts of the solution in each language.
1. Processing training data
Assume the training data has been loaded into the variable data
. This code separates the data into two column vectors and counts the number of training examples m
.
Octave
X = data(:, 1);
y = data(:, 2);
m = length(y);
Python
X = data[:, 0:1]
y = data[:, 1:2]
m = len(y)
JavaScript
var X = math.subset(data, math.index(math.range(0, m), 0));
var y = math.subset(data, math.index(math.range(0, m), 1));
var m = math.size(y)[0];
You can see that Octave and Python look quite similar. The tricky part of the Python solution was to use slice indexing (0:1
instead of 0
) to maintain rank-2 arrays. The JavaScript solution is very verbose in comparison.
2. Cost function
Now let’s examine a typical linear regression cost function in each language.
Octave
function J = computeCost(X, y, theta)
h = X * theta;
err = h - y;
J = 1 / (2 * m) * err' * err;
end
Python
def computeCost(X, y, theta):
h = np.dot(X, theta)
err = h - y
return 1.0 / (2.0 * m) * np.dot(err.T, err)
JavaScript
function computeCost(X, y, theta) {
var h = math.multiply(X, theta);
var err = math.subtract(h, y);
return 1 / (2 * m) * math.multiply(math.transpose(err), err);
}
The Octave solution is wonderfully concise and elegant.
The Python solution comes close. We use numpy’s array
data type as opposed to its matrix
data type (as recommended). The only downside of this is that we must resort to the function call dot()
to perform matrix multiplication. This pollutes things somewhat and is a bit of a drag.
Once again the JavaScript solution is quite ugly. Every matrix operation requires a function call: multiply()
, subtract()
, transpose()
.
3. Gradient descent
Octave
function theta = gradientDescent(X, y, theta, alpha, num_iters)
for iter = 1:num_iters
h = X * theta;
err = h - y;
theta_change = alpha / m * (X' * err);
theta = theta - theta_change;
end
end
Python
def gradientDescent(X, y, theta, alpha, num_iters):
for i in range(0, num_iters):
h = np.dot(X, theta)
err = h - y
theta_change = alpha / m * np.dot(X.T, err)
theta = theta - theta_change
return theta
JavaScript
function gradientDescent(X, y, theta, alpha, num_iters) {
for (var i = 0; i < num_iters; i++) {
var h = math.multiply(X, theta);
var err = math.subtract(h, y);
var theta_change = math.multiply(alpha / m, math.multiply(math.transpose(X), err));
theta = math.subtract(theta, theta_change);
}
return theta;
}
Very similar results to the cost function. Octave is the most elegant. Python is OK apart from that annoying dot()
function call. And JavaScript is a hot mess.
Conclusion
Octave has the simplest and cleanest syntax for performing linear algebra. It’s a great choice for learning, studying, and prototyping machine learning problems.
Python is close behind Octave in succinctness. It has other things going for it, however. It’s a mainstream programming language with a huge user base and massive library support—this makes it the go-to choice for machine learning in industry.
JavaScript is a clunky choice for performing linear algebra and machine learning. This hasn’t stopped motivated people from going ahead and doing it anyway, so your mileage may vary.
For us, then, it’s a toss-up between Octave and Python to build, train and fine tune our machine learning models. We’ll avoid JavaScript if we can.
At this stage there’s no way around using JavaScript for the Typefi Engine, so if it ever becomes necessary to embed machine learning code directly inside the Engine then we’re faced with a conundrum. I suspect that we’ll have to port our Octave/Python code to JavaScript at the very last minute.
However, once our Engineering team begins work in the machine learning space it’s very possible that we’ll find a better solution. We’re looking forward to giving it a go!
Originally published at https://www.typefi.com/machine-learning-language/
Please sign in to leave a comment.
Comments
0 comments