Ian Mobbs • 21 May 2017
Okay, okay, sorry for the clickbait title. I guess a more proper introduction to this article would be “Neural Networks (for personal projects) are Overrated”. Before I took my first Machine Learning course at The University of Texas at Austin, I attempted to learn various ML techniques on my own. I watched Sirajology videos religiously. I tried retraining Tensorflow’s Inception-v3 network. I skimmed through Neural Networks and Deep Learning whenever I had spare time. I was looking for a more practical rather than theoretical introduction, which these resources did very well - but there was an issue. Each of these resources praised neural networks as the pinnacle of machine learning. While they may be a powerful tool, it’s not appropriate to use them for every single problem. Newcomers to machine learning need to learn the basics. Unless you’re working on a massive scale, tried and true algorithms (such as those found in scikit-learn) should almost always be used instead.
There’s much more to the world of machine learning than neural networks, and assuming a neural network should be used to solve every problem is a dangerous mindset to have. Here’s an excellent StackOverflow discussion on when to use Genetic Algorithms vs Neural Networks. If you want to read a little more about that, I suggest checking here. In addition to what’s been mentioned, neural networks are (by definition) the tool of choice for deep learning:
Deep learning is a class of machine learning algorithms that use a cascade of many layers of nonlinear processing units for feature extraction and transformation.
For most personal projects though, a neural network isn’t needed. I’ll demonstrate that by analyzing the dataset Homicide Reports, 1980-2014 from the Murder Accountability Project, who hosted their data on Kaggle.
The data contains 638,454 records of homicides across the country and the following information about each homicide:
# Imports import time import pandas as pd import numpy from sklearn.metrics import accuracy_score # Inline graphics %pylab inline # Read data df = pd.read_csv('database.csv') df[:5]
Populating the interactive namespace from numpy and matplotlib /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/IPython/core/interactiveshell.py:2683: DtypeWarning: Columns (16) have mixed types. Specify dtype option on import or set low_memory=False. interactivity=interactivity, compiler=compiler, result=result)
|Record ID||Agency Code||Agency Name||Agency Type||City||State||Year||Month||Incident||Crime Type||...||Victim Ethnicity||Perpetrator Sex||Perpetrator Age||Perpetrator Race||Perpetrator Ethnicity||Relationship||Weapon||Victim Count||Perpetrator Count||Record Source|
|0||1||AK00101||Anchorage||Municipal Police||Anchorage||Alaska||1980||January||1||Murder or Manslaughter||...||Unknown||Male||15||Native American/Alaska Native||Unknown||Acquaintance||Blunt Object||0||0||FBI|
|1||2||AK00101||Anchorage||Municipal Police||Anchorage||Alaska||1980||March||1||Murder or Manslaughter||...||Unknown||Male||42||White||Unknown||Acquaintance||Strangulation||0||0||FBI|
|2||3||AK00101||Anchorage||Municipal Police||Anchorage||Alaska||1980||March||2||Murder or Manslaughter||...||Unknown||Unknown||0||Unknown||Unknown||Unknown||Unknown||0||0||FBI|
|3||4||AK00101||Anchorage||Municipal Police||Anchorage||Alaska||1980||April||1||Murder or Manslaughter||...||Unknown||Male||42||White||Unknown||Acquaintance||Strangulation||0||0||FBI|
|4||5||AK00101||Anchorage||Municipal Police||Anchorage||Alaska||1980||April||2||Murder or Manslaughter||...||Unknown||Unknown||0||Unknown||Unknown||Unknown||Unknown||0||1||FBI|
5 rows × 24 columns
In order to demonstrate that other machine learning techniques can be just as powerful as neural networks, without the cost, we’re going to solving a simple classification problem using our data (each of these columns contains categorical data anyway, so it’d be pretty difficult to do any regression - but if anyone has a problem they’d like me to solve, I’d be happy to!).
For any crime, can we predict the race of the perpetrator based on other information?
Once you’ve identified your problems, the folks over at
scikit-learn have created an excellent cheatsheet on how to pick an algorithm to use:
Following the chart, we’re going to use a SGD Classifier. In order to classify a perpetrators race based on other data, I decided to look at the dataframe columns by hand and use the features that I deemed important. This is terrible practice. I did this because, due to the sheer volume of data, training on every feature and then isolating what’s important would’ve been far too difficult for my little computer. The point of this article is to highlight the difference in training times for neural networks and algorithms to reach similar results - not to create a perfect classifier.
To clean up and split our data, we do a little bit of preprocessing on our own and let patsy take care of the rest. We change our dataset to only contain homicides where the perpetrators race is known and the crime is solved. This is because whether or not a crime is solved has no effect on the perpetrators race (although the opposite may be true), and we can’t train on unknown data. After this, we write a formula using the isolated features, create some design matrices, and split our data into training and testing data.
from patsy import dmatrices from sklearn.model_selection import train_test_split # Isolate data where perpetrators race is known and crime is solved start = time.time() data = df[(df['Perpetrator Race'] != 'Unknown') & (df['Crime Solved'] == 'Yes')] # Race known, case solved - Training data end = time.time() print("Time taken separating data:", end - start) # Create patsy formula using different information geographicInfo = "City + State" crimeInfo = "Q('Crime Type') + Weapon + Incident" victimInfo = "Q('Victim Sex') + Q('Victim Age') + Q('Victim Race') + Q('Victim Ethnicity') + Relationship" formula = "Q('Perpetrator Race') ~ 0 + " + " + ".join([geographicInfo, crimeInfo, victimInfo]) # Split data into design matrices start = time.time() _, X = dmatrices(formula, data, return_type='dataframe') y = data['Perpetrator Race'] end = time.time() print("Time taken creating design matrices:", end - start) # Split data into training and testing data start = time.time() X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=1) end = time.time() print("Time taken splitting data:", end - start) print("Total data size:", len(data)) print("Training data size:", len(X_train)) print("Testing data size:", len(X_test)) baseline = data['Perpetrator Race'].value_counts() / data['Perpetrator Race'].value_counts().sum() print("Baseline accuracy: ", baseline)
Time taken separating data: 0.21260476112365723 Time taken creating design matrices: 58.753602027893066 Time taken splitting data: 5.484054088592529 Total data size: 442123 Training data size: 296222 Testing data size: 145901 Baseline accuracy: 0.493432822993
Creating an SGD Classifier with
scitkit-learn is incredible easy - as you can see, it takes three lines of code to instantiate, train, and predict with your classifier. It’s performance is lacking (but again - our features were picked not to maximize accuracy, but to compare accuracy in tandem with training times). While 82% accuracy is low, it’s a 33% accuracy improvement over our baseline. The 17 second training time on 296,000 rows of data is impressive.
from sklearn import linear_model start = time.time() classifier = linear_model.SGDClassifier() classifier.fit(X_train, y_train) end = time.time() print("SGDClassifier Training Time:", end - start) start = time.time() predictions = classifier.predict(X_test) end = time.time() print("SGDClassifier Prediction Time:", end - start) print("SGDClassifier Accuracy:", accuracy_score(predictions, y_test))
SGDClassifier Training Time: 17.512683868408203 SGDClassifier Prediction Time: 0.7582650184631348 SGDClassifier Accuracy: 0.827568008444
In order for the neural network (which you can read about here) to train in a somewhat timely manner on my Macbook Pro, I had to stifle it’s capabilities significantly by adjusting the size of it’s hidden layers and the number of iterations. Even after these customizations though, training took 387 seconds - around 22.7 times longer than the SGDClassifer, with only a 4.5% accuracy increase.
from sklearn.neural_network import MLPClassifier start = time.time() classifier = MLPClassifier(hidden_layer_sizes=(10,), max_iter=100) classifier.fit(X_train, y_train) end = time.time() print("Neural Network Training Time:", end - start) start = time.time() predictions = classifier.predict(X_test) end = time.time() print("Neural Network Prediction Time:", end - start) print("Neural Network Accuracy:", accuracy_score(predictions, y_test))
Neural Network Training Time: 387.71303701400757 Neural Network Prediction Time: 1.6371817588806152 Neural Network Accuracy: 0.872084495651
It’s clear that the SGD Classifier outperformed the Neural Network. I hope this quick article goes to show that there’s more to Machine Learning than neural networks, and that when solving an ML problem, all options should be considered. If you’re looking for a practical introduction to Machine Learning, the book I used (that I highly recommend) is Müeller and Guido’s “Introduction to Machine Learning with Python: A Guide for Data Scientists”.
Andreas Mueller commented on the version of my article hosted on The Practical Dev, and his comment was very insightful - I’d like to share it here:
Cool article and an important point. Thanks for recommending our book! Neural networks are certainly not a cure-all, though it’s tricky business to compare different algorithms on the same data because there are so many hyper parameters. For example increasing the number of iterations or using the sag sober solver might have improved the linear model, but take longer. Similarly a larger hidden layer (or changing any of the other tuning parameters) might have positive effects for the neural network. I think the main takeaway should be: never try neural networks first. Start with something simple and try complex models later if the gain in accuracy justifies for the added complexity in your application.