Machine Learning Cheat Sheet
Machine learning is the method of algorithms understanding processes without programming. As part of artificial intelligence (AI), machine learning accesses data and learns by itself. It’s a fascinating field of study that can even be used to predict future events based on past data. Predictive analytics, deep learning, algorithms, and supervised and unsupervised learning are all part of machine learning. Many advancements in AI are due to machine learning algorithms. This could range from recommendations you see on YouTube, Google and other major sites that track data, such as clicks, likes and interests, in the frequently visited websites. In this way, the algorithm “learns” what you like and provides recommendations.
However, it’s easy to get lost when reading about machine learning, as it’s a big field that takes time to learn. Even after learning about it, there is more to learn, as machine learning develops over time due to how fast technology progresses, especially in recent decades. The following machine learning cheat sheet may prove helpful in learning the basics or refreshing your memory on certain terms.
With datascienceacademy.io, you can learn even more about machine learning techniques to advance in your data science and AI/ML career.
The following includes definitions of common machine learning terms.
Accuracy
Accuracy is the percentage of accurate predictions made by a model.
Algorithm
A function, method or series of commands used to create a machine learning model. Examples of an algorithm are neural networks, linear regression, support vector machines and decision trees.
Attribute
A quality describing an observation (e.g. color, size, weight). Attributes are column headers in Excel terms.
Bias metric
How do you find the average difference between the correct value for your observation and your predictions?
- Low bias could mean every prediction is correct. It could additionally mean part of your predictions are above their actual values and part are below, in equal proportion, which results in a low average difference.
- High bias (with low variance) suggests your model may be underfitting and you’re using the wrong architecture for the job.
Bias term
Allow models to represent patterns that do not pass through the origin. For example, if all of the features were 0, would the output also be zero? Is it probable there is some base value upon which my features have an effect? Bias terms typically supplement weights and are attached to filters or neurons.
Categorical Variables
Variables with a discrete set of possible values. Can be ordinal (order matters) or nominal (order doesn’t matter).
Classification
Predicting a categorical output.
- Binary classification predicts one of two possible outcomes (whether the email is spam or not)
- Multi-class classification predicts one of the multiple possible outcomes (Is this photo depicting a human, cat or dog?)
Classification Threshold
The lowest probability value where we’re comfortable stating a positive classification. For example, if the predicted probability of being diabetic is > 50%, return True, otherwise, return False.
Clustering
Unsupervised grouping of data into buckets.
Confusion Matrix
The confusion matrix is a table that defines the performance of a classification model by grouping predictions into four categories.
- True Positives: we correctly predicted they do have diabetes
- True Negatives: we correctly predicted they don’t have diabetes
- False Positives: we incorrectly predicted they do have diabetes (Type I error)
- False Negatives: we incorrectly predicted they don’t have diabetes (Type II error)
Continuous Variables
Variables with a range of possible values that are defined by a number scale (lifespan, sales, etc.).
Convergence
A state reached during the training of a model when the loss changes very little between each iteration.
Deduction
A top-down approach is applied by answering questions or solving problems. It’s a logic practice that begins with a theory and tests the theory with observations to form a conclusion. We might suspect something is true, so we test it to see if it’s true or not.
Deep Learning
Deep learning comes from a machine learning algorithm called perceptron or multi-layer perceptron that is rising in popularity because of its success in fields ranging from computer vision to signal processing and medical diagnosis to self-driving cars. Many AI algorithms containing deep learning are from decades, but now more data and cheap computing power make this algorithm powerful to achieve near-perfect accuracy. Currently, this algorithm is known as an artificial neural network, though deep learning is more than just a traditional artificial neural network. Deep learning was greatly influenced by machine learning’s neural network and perceptron network.
Dimension
Dimension for machine learning and dimension for data scientists is different from physics. The Dimension of data is how many features you have in your data ocean (data set). One example: In object detection application, flatten image size and color channel ( e.g 28*28*3) is a feature of the input set. In house price prediction, house size could be the data set, so we call it one-dimensional data.
Epoch
An epoch explains the number of times the algorithm looks at the entire data set.
Extrapolation
Extrapolation is making forecasts outside the range of a dataset (My cat meows, so all cats must meow.) We often run into trouble in machine learning when we extrapolate outside of our training data range.
False Positive Rate
Defined as
????????????=1−????????????????????????????????????????????=????????????????????????????????????????????????????????????????????????????????????????????????????????????????+????????????????????????????????????????????????????FPR=1−Specificity=FalsePositivesFalsePositives+TrueNegatives
The false positive rate creates the x-axis of the ROC curve.
Feature
For a dataset, a feature symbolizes a value and attribute combination. Color is an attribute. “Color is blue” is a feature. In Excel terms, features are similar to cells. The term feature has various other meanings in different contexts.
Feature Selection
Feature selection is the process of selecting relevant features from a data-set for creating a Machine Learning model.
Feature Vector
A list of features defining an observation with multiple attributes. We call this a row in Excel.
Gradient Accumulation
A tool used to split the batch of samples, which is used for training a neural network, into a few mini-batches of samples that will run consecutively. It is also used to allow the use of large batch sizes that require more GPU memory than currently available.
Hyperparameters
Hyperparameters are high-level components of a model, such as how fast it can learn—the learning rate—or how complex the model is. The tree depth in a decision tree and the number of hidden layers within neural networks are hyperparameter examples.
Induction
A bottom-up approach that answers questions or solves problems. Induction is a logical method that switches from observations to theory. If we observe one scenario that proves true, then this hypothesis must be true.
Instance
A row, data point or sample within a dataset. It’s also another name for observation.
Label
The Label is the answer part of observation in supervised learning. For example, in a dataset used to classify trees into different species, the features might include the heights and width of trees, while the label would be the tree species.
Learning Rate
The dimension of the update steps to use during optimization loops, such as gradient descent. Using a high learning rate, we can learn more in each step, but we do risk exceeding the lowest point, as the hill slope changes constantly. With a low learning rate, we can move with assurance in the direction of the negative gradient, as we are recalculating it often. A low learning rate is more precise, but figuring out the gradient takes time, so it will take a while to get to the bottom.
Loss
Loss = true_value (from data-set)- predicted value(from ML-model)
A lower loss means it’s a better model (unless the model has overfitted to the training data). The loss is calculated on training and validation, and its interpretation is how the model is performing for these two sets. Loss is not a percentage, contrary to accuracy. It is an outline of the errors made for each example in training or validation sets.
Machine Learning
An algorithm that learns processes without being programmed to do so. A field in which algorithm could even predict future events based on observing past data.
Model
Models are data structures that store a representation of a dataset (weights and biases). Models are created and learned when you train an algorithm on a dataset.
Neural Networks
Neural networks are mathematical algorithms shaped from the brain’s architecture. They are designed to recognize relationships and patterns in data.
Normalization
Restriction of weight values in regression to prevent overfitting and increase computation speed.
Noise
Any unrelated information or randomness in a dataset complicates the underlying pattern.
Null Accuracy
Baseline accuracy, that is accomplished by constantly predicting the most frequent class (something has a high frequency, so it’s chosen for the prediction).
Observation
A row, data point or sample in a dataset. Observation is another term for instance.
Outlier
An observation that differs considerably from other observations in the dataset.
Overfitting
Overfitting happens when your model learns the training data too well and provides details particular to your dataset. A model is overfitting when it performs great on the training/validation set, but badly on the test set.
Parameters
Parameters are components of training data learned by training a machine learning model or classifier. They are changed using optimization algorithms and are distinct to each experiment.
Examples of parameters include:
- weights in an artificial neural network
- support vectors in a support vector machine
- coefficients in a logistic or linear regression
Precision
In binary classification (yes or no), precision determines the model’s performance at classifying positive observations (i.e. “Yes”). It answers the question: If a positive value is predicted, how often is the prediction correct? We can play with this metric by only returning positive for the one observation we believe is true, as shown:
????=????????????????????????????????????????????????????????????????????????????????????????????????????????+????????????????????????????????????????????????????????
P=TruePositivesTruePositives+FalsePositives
Recall
Recall is sometimes called sensitivity. In binary classification (yes or no), recall finds out how “sensitive” the classifier is at finding positive instances. For all the true observations within the sample, how many did we actually find? We could game this metric by always categorizing observations as positive.
????=????????????????????????????????????????????????????????????????????????????????????????????????????????+????????????????????????????????????????????????????????
R=TruePositivesTruePositives+FalseNegatives
Recall vs Precision
If we are analyzing brain scans and trying to predict if a person has a tumor (true) or not (false), we put it into our model and the model starts guessing.
- Precision is the % of true guesses that were right. If we guess one image is true out of 100 total images and that image is actually true, then our precision is 100%. Our results aren’t helpful though, since we missed 10 brain tumors. We were very precise when we tried, but we didn’t try hard enough.
- Recall or sensitivity provides another way to see if our model is accurate or not. Let’s use the 100 images example again. 10 images with brain tumors, and we correctly guessed one had a brain tumor. Precision is 100%, but recall is 10%. Perfect recall requires that we catch all 10 tumors.
Regression
Predicting a constant output, such as price or sales.
Regularization
Regularization is a method used to prevent the common overfitting problem. This is accomplished by adding a complexity term to the loss function that gives a larger loss for more complex models.
Reinforcement Learning
Training a model to maximize a reward through trial and error.
ROC (Receiver Operating Characteristic) Curve
A plot of the true positive rate against the false-positive rate at all classification thresholds. The ROC curve evaluates the performance of a classification model at various classification thresholds. The area under the ROC curve can be viewed as the probability that the model distinguishes between a randomly chosen positive observation and a randomly chosen negative observation.
Segmentation
It is the method of separating a data set into many distinct sets. This separation is done such that the members of the same set are similar to each other and different from the members of other sets.
Specificity
In the context of binary classification (Yes/No), specificity measures the model’s performance at classifying negative observations (i.e. “No”). When the correct label is negative, how often is the prediction correct?
????=????????????????????????????????????????????????????????????????????????????????????????????????????????+????????????????????????????????????????????????????????
S=TrueNegativesTrueNegatives+FalsePositives
Supervised Learning
Training a model using a labeled dataset.
Test Set
A group of observations utilized at the end of model training and validation to find the predictive power of the model. How will the model react to unseen data?
Training Set
A group of observations used to create machine learning models.
Transfer Learning
Another machine learning process where a model created for a task is reused as the starting point for a model on a second task. In transfer learning, you will take the pre-trained weights of an already trained model (one that has been trained with millions of images belonging to thousands of classes on several high power GPUs for several days) and use these features that have been learned in order to predict new classes.
True Positive Rate
Another term for recall.
????????????=????????????????????????????????????????????????????????????????????????????????????????????????????????+????????????????????????????????????????????????????????
TPR=TruePositivesTruePositives+FalseNegatives
The true positive rate creates the y-axis of the ROC curve.
Type 1 Error
These are false positives. Consider a company optimizing hiring practices to reduce false positives in job offers. A type one error occurs when candidate seems good and they hire him, but he is actually bad.
Type 2 Error
False Negatives. The candidate was great but the company passed on him.
Underfitting
Underfitting occurs when your model over-generalizes and fails to incorporate relevant variations in your data that would give your model more predictive power. You can tell a model is underfitting when it performs poorly on both training and test sets.
Universal Approximation Theorem
A neural network with a single hidden layer can estimate any continuous function but only for inputs in a particular range. If you train a network on inputs between -10 and 10, then it will work great for inputs in the same exact range, but it won’t generalize to other inputs without retraining the model or adding more hidden neurons.
Unsupervised Learning
Training a model to search for patterns in an unlabeled dataset (e.g. clustering).
Validation Set
A group of observations utilized during model training to form feedback on how well the current parameters generalize beyond the training set. If training error decreases but validation error increases, your model is probably overfitting and you should stop training.
Variance
How closely packed are the predictions for a certain observation relative to each other?
- Low variance suggests your model is internally consistent, with predictions varying little from each other after every iteration.
- High variance suggests your model may be overfitting and reading too deeply into the noise found in every training set.
Linear Regression
Linear regression is a supervised machine learning algorithm that is important to understand. There are two types: simple regression and multivariable regression.
Polynomial
Polynomial regression is a modified form of linear regression where the current features are mapped to a polynomial form. The issue remains to be a linear regression problem, but the input vector is now mapped to a higher dimensional vector that acts as a pseudo-input vector.
????=(????0,????1)→????'=(????0,????20,????1,????21,????0????1)x=(x0,x1)→x'=(x0,x02,x1,x12,x0x1)
Lasso
Lasso regression attempts to lessen the ordinary least squares error much like the vanilla regression, but it also adds an extra term. The sum of the ????1L1 norm for every data point multiplied by a hyperparameter ????αis used. This decreases model complexity and prevents overfitting (like discussed earlier).
????=∑????=1????(????????−????̃)2+????∑????=1????|????????|l=∑i=1n(yi−y~)2+α∑j=1p|wj|
Ridge
Ridge regression is like lasso regression, but the regularization word uses the ????2L2 norm instead.
????=∑????=1????(????????−????̃)2+????∑????=1????????2????l=∑i=1n(yi−y~)2+α∑j=1pwj
Neural Networks
Networks are a class of machine learning algorithms that models complex patterns in a dataset using many hidden layers and non-linear activation functions. A neural network uses an input, sends it through multiple layers of hidden neurons (mini-functions with special coefficients that must be learned) and outputs a prediction showing the combined input of all the neurons.
Neural networks are trained using optimization techniques, such as gradient descent. After each training cycle, an error metric is computed based on the difference between prediction and target. The derivatives of this error metric are calculated and circulated back through the network using a method called backpropagation. Each neuron’s coefficients (weights) are then adjusted relative to how much they contributed to the total error. This process is repeated until the network error drops below an acceptable threshold.
Neurons
A neuron uses a collection of weighted inputs, applies an activation function and returns an output.
Inputs to a neuron can either be features from a training set or outputs from a previous layer’s neurons. Weights are applied to the inputs as they journey along synapses to find the neuron. Finally, the neuron then applies an activation function to the “sum of weighted inputs” from each inbound synapse and moves the result on to all the neurons in the next layer.
Synapse
Synapses are roads in a neural network. They connect inputs to neurons, neurons to neurons, and neurons to outputs. To travel from one neuron to another, they have to travel along the synapse paying the “toll” (weight) along the way. Each connection between two neurons has a special synapse with a unique weight attached to it.
Weights
Weights are values that manage the power of the connection between two neurons. Inputs are usually multiplied by weights, and that defines how much influence the input will have on the output. In other words, when the inputs are transferred between neurons, the weights are applied to the inputs along with an additional value (the bias).
Bia
Bias terms are additional constants attached to neurons and added to the weighted input before the activation function is applied. Bias terms assist models with representing patterns that do not particularly pass through the origin.
Connect with our experts if you want to know which Data Science Training is best suitable for your career to become a Machine Learning expert.
We hope you found this machine learning cheat sheet useful. Machine learning is an exciting field that is revolutionizing the world. Many companies like Google, YouTube, Netflix and Apple use aspects of machine learning, like algorithms, in their everyday practices in order to automate their processes through artificial intelligence. Machine learning is important to the future of technology development.