Machine Learning is a ubiquitous tool in the modern world, and a great way to predict future behavior based on observed data. However, machine learning isn’t a magic bullet, and it will only be helpful when fed the correct data and calibrated the correct way. It’s important to know the baseline correctness for a problem, the quality of data available for solving the problem, and which machine learning approach to use given the data-set and expected outcomes.
Initially, a baseline for a problem should be set, so the data scientist knows if the solution is a worthwhile one. A machine learning algorithm that guesses correct on a coin flip 30% of the time is a bad algorithm, since guessing heads would be correct 50% of the time. However, a machine learning algorithm that guesses correct on a dice roll 30% of the time would be spectacular, since one could only expect about 16.7% success from guessing. In general, the baseline for a machine learning problem is the highest success rate one could expect by guessing the most likely outcome. If the algorithm is less successful than a guess, it means it’s time to find a new algorithm.
A machine learning algorithm is useless without plentiful, relevant data. No matter how advanced a neural network may be, it won’t be able to predict tomorrow’s stock prices based on today’s NBA scores. The data scientist problem, then, is finding out if the data is good. A good start might be to visualize the data, and see if there’s any loose trend relating the potential predictors with the predicted value. A few scatterplots with the predicted value as the Y value and a predictor as the X value, for each predictor, might be a good place to start. If there’s some underlying relationship between a predictor and the predicted value it would likely show up in a scatterplot, the human eye is very good at identifying whether or not any pattern exists. Another approach might be to try a few different types of quick regressions: logistic, linear, and exponential, to see if a single variable can achieve a decent R2 value, meaning it explains some of the variance in the predicted value. Since most machine learning approaches are pretty flexible, the type of relationship isn’t as important as the fact that there is one.
Once the data has been confirmed as relevant, the appropriate machine learning model needs to be selected. It’s important to account for the number of observations, the number of predictors in each observation, and whether this is a prediction or classification problem. It’s also important to correctly divide up the data into training and testing sets, which will help to make sure the model doesn’t overfit the training data and become useless for the test data. A simple Google search for the ideal machine learning technique based on the shape, size, and type of data will point the data scientist in the right direction.
When using machine learning, it’s important to understand how to tweak and optimize algorithms, but it’s much more important to be able to recognize if machine learning is a realistic solution to a problem. The set of problems that machine learning can solve grows every day, but there are still problems unsolvable by machine learning. Machine learning is simply a tool in the data scientist’s toolbox, and though it’s a powerful tool, it shouldn’t be viewed as the be-all-end-all solution to every data problem.