Machine Learning (ML)#
Machine Learning is a branch of Artificial Intelligence that focuses on models and algorithms that let computers learn from data and improve from previous experience without being explicitly programmed. There are many types of machine learning.
Note
Some methods fit into multiple categories or can be adapted to be used for other categories. For the sake of brevity, these cases are not always mentioned here.
Note
I’m not sure how Neural Network’s, Deep Learning, AutoEncoders, DenseNets, etc. fit into these categories. They may span multiple categories, or perhaps these are more traditional ML techniques.
Categories#
Supervised Learning - Use labeled data.
Classification - Predict categorical (discrete) values.
Regression - Predict continuous numerical values.
Classification|Regression - Some models can perform either Classification or Regression.
Ensemble Learning - Combine multiple models of either type into one better model.
Bagging (Bootstrap Aggregating) Method - Train models independently on different subsets of the data, then combine their predictions.
Boosting Method - Train models sequentially, each model focusing on errors of prior models, then do weighted combination of their predictions.
Stacking (Stacked Generalization) Method - train multiple different models (often different types), use predictions as inputs to final “meta-model”.
Unsupervised Learning - Use unlabeled data.
Clustering - Group data into clusters based on similarity.
Centroid-Based (Partitioning) Clustering cluster around centroids of points, choose number of clusters in advance.
Distribution-Based Clustering - cluster by mixture of probability distributions.
Connectivity-Based (Hierarchical) Clustering - cluster with tree-like nested groupings by connections between points.
Density-Based (Model-Based) Clustering - clusters as contiguous regions of high data density separated by areas of lower density.
Dimensionality Reduction - Simplify datasets by reducing features while keeping important information (often used to select features for other models).
Association Rule Mining - Discover rules where the presence of one item in a dataset indicates the probability of the presence of another.
Reinforcement Learning - Agent learns by interacting with environment via trial and error and receiving reward feedback.
Model-Based Methods - interact with a simulated model of the environment, helping the agent plan actions by simulating potential results.
Model-Free Methods - interact with the actual environment, learning directly from experience.
Forecasting Models - Use past data to predict future trends (often time series problems).
Semi-Supervised Learning - Use some labeled data with more unlabeled data.
Self-Supervised Learning - Generates its own labels from unlabeled data.
Supervised Learning#
Classification#
KNN (K-Nearest Neighbors) - simple, looks at closest data points (neighbors) to make predictions based on similarity
Logistic Regression - Draws a sigmoid curve, predicts 0 or 1 if above or below curve. Despite “Regression” being in the name, it’s for Classification
Single-Layer Perceptron - a single layer with a single neuron? Why?
SGD (Stochastic Gradient Descent) Classifier - adjust model parameters in the direction of the loss function’s greatest gradient
Naive Bayes (Gaussian, Multinomial, Bernoulli, Complement) - predicts the category of a data point with probability
Regression#
Linear Regression - fit a straight line to the data with Least Squares Method
Multiple Linear Regression - Extends Linear Regression to use multiple input variables
Polynomial Regression - a polynomial curve fit.
Lasso Regression (L1 Regularization) - regularized linear regression that avoids overfitting by penalizing the absolute value of large coefficients
Ridge Regression (L2 Regularization) - regularized linear regression that avoids overfitting by penalizing the square of large coefficients
Classification|Regression#
SVM (Support Vector Machine)/SVR (Support Vector Regression) - use for Classification by finding a hyperplane that separates classes of data (SVM), or use for regression by finding the hyperplane that minimizes the residual sum of squares (SVR). Can be Linear or Non-Linear depending on the Kernel you select.
Multi-Layer Perceptron - classic neural network.
Decision Trees (Introduction) (Classification) (Regression) - hierarchical tree structure that works like a flow chart. splits data into branches based on feature values. Often used as building blocks for Ensemble methods. (CART (Classification and Regression Trees)) is based on (ID3 (Iterative Dichotomiser 3)), and is a specific algorithm for building decision trees that can be used for both classification and regression.
Ensemble Learning#
Bagging#
Random Forest (Classification) (Regression) (Hyperparameter Tuning) - create many decision trees, train each on random parts of data, combine results via voting (for classification) or averaging (for regression)
Random Subspace Method - train on random subsets of input features to enhance diversity and improve generalization while reducing overfitting.
Boosting#
AdaBoost (Adaptive Boosting) - for challenging examples, assign weights to data points, combine weak classifiers with weighted voting
GBM (Gradient Boosting Machines) - sequentially build decision trees, each tree correcting errors of previous ones
XGBoost (Extreme Gradient Boosting) - optimizes like regularization and parallel processing for robustness and efficiency
CatBoost (Categorical Boosting) - handles categorical features natively without extensive preprocessing
Stacking#
Stacks methods discussed above like K-Nearest Neighbors, Perceptron and Logistic Regression
Unsupervised Learning#
Clustering#
Centroid-Based#
K-Means Clustering - groups data into K clusters based on how close the points are to each other. Iteratively assigns points to the nearest centroid, recalculating centroids after each addition. Can use the Elbow Method to choose a good value for K
KMeans++ Clustering - improves K-Means by choosing initial cluster centers intelligently instead of randomly
K-Medoids Clustering - similar to K-means, but uses actual data points (medoids) as the centers, making more robust to outliers
FCM (Fuzzy C-Means Clustering) - similar to K-means but uses Fuzzy Clustering, allowing each data point to belong to multiple clusters with varying degrees of membership
K-Mode Clustering - works on categorical data, unlike K-Means which is for numerical data
Distribution-Based#
GMM (Gaussian Mixture Models) - fits data as a weighted mixture of Gaussian distributions and assigns data points based on likelihood
DPMMs (Dirichlet Process Mixture Models) - extension of Gaussian Mixture Models that can automatically decide the number of clusters based on the data
EM (Expectation-Maximization) Algorithm - Estimate unknown parameters using
E-Step
(Expectation Step
) (calculating expected values of missing/hidden variables) andM-Step
(Maximization Step
) (maximizing log-likelihood to see how well the model explains the data)
Connectivity-Based#
Hierarchical Clustering - create clusters by building a tree step-by-step, merging or splitting groups
Agglomerative Clustering - (Bottom-up) start with each point as a cluster and iteratively merge the closest ones
Divisive Clustering - (Top-down) starts with one cluster and splits iteratively into smaller clusters
Spectral Clustering - groups data by analyzing connections between points using graphs
AP (Affinity Propagation) - identify data clusters by sending messages between data points, calculates optimal number of clusters automatically
Density-Based#
Mean-Shift Clustering - discovers clusters by moving points towards crowded areas
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) - Groups points with sufficient neighbors, labels sparse points as noise
OPTICS (Ordering Points To Identify the Clustering Structure) - extends
DBSCAN
to handle varying densities
Dimensionality Reduction#
PCA (Principal Component Analysis) - Reduces dimensions by transforming data into uncorrelated principal components.
NMF (Non-negative Matrix Factorization) - Breaks data into non-negative parts to simplify representation.
Isomap - Captures global data structure by preserving distances along a manifold.
LLE (Locally Linear Embedding) - Reduces dimensions while preserving the relationships between nearby points.
LDA (Linear Discriminant Analysis) - Reduces dimensions while maximizing class separability for classification tasks.
Association Rule Mining#
Apriori Algorithm (Implementation) - Finds patterns by exploring frequent item combinations step-by-step.
FP-Growth (Frequent Pattern-Growth) - An Efficient Alternative to Apriori. It quickly identifies frequent patterns without generating candidate sets.
ECLAT (Equivalence Class Clustering and Bottom-Up Lattice Traversal) - Uses intersections of itemsets to efficiently find frequent patterns.
Efficient Tree-based Algorithms - Scales to handle large datasets by organizing data in tree structures.
Reinforcement Learning#
Model-Based#
MDPs (Markov Decision Processes) - describe step-by-step decisions where the results of actions are uncertain. Evaluates all possible moves?
Monte Carlo Tree Search - designed to solve problems with huge decision spaces, like the board game Go with \(10^{170}\) possible board states, by building a search tree iteratively/randomly instead of exploring all possible moves.
Model-Free#
Q-Learning - makes trial-and-error guesses, building and updating a
Q-table
which storesQ-values
which estimate how good it is to take a specific action in a given state.Deep Q-Learning - Regular Q-Learning is good for small problems, but struggles on complex ones (like images) since the
Q-table
gets huge and computationally expensive. Deep Q-Learning fixes this by using a neural network to estimate theQ-values
instead of aQ-table
SARSA (State-Action-Reward-State-Action) - helps an agent to learn an optimal policy by exploring the environment, taking actions, receiving feedback, and updating behavior for long-term rewards.
REINFORCE Algorithm - instead of estimating how good each action is, just tries actions and adjusts the chances of those actions based on the total reward afterwards
Actor-Critic Algorithm - combines an Actor (which selects actions via a Policy Gradient) and Critic (which evaluates the Actor via a Value Function), both of which learn (like your Loss function is getting smarter alongside your model)
A3C (Asynchronous Advantage Actor-Critic) - uses multiple agents which learn in parallel, each interacting with their own private environments, then contribute their updates to a shared global model.
Forecasting Models#
ARIMA (Auto-Regressive Integrated Moving Average) - Combines
Autoregression
(AR
),Differencing
(I
) andMoving Averages
(MA
) to capture patterns to predict future values based on historical data. Not great with seasonal data..SARIMA (Seasonal ARIMA) - extension of ARIMA designed for time series data with seasonal patterns.
Exponential Smoothing - assumes future patterns will be similar to more recent past data, focuses on learning average demand level over time. Simple and accurate for short-term forecasts, not great for long term forecasts. Uses
Simple
,Double
, orHolt-Winters
Exponential Smoothing
.RNNs (Recurrent Neural Networks) (Tensorflow Example) - neural networks where information can be passed backwards as well as forwards. They have many uses beyond forecasting, such as text generation
LSTM (Long Short-Term Memory) - use a memory mechanism to overcome the vanishing gradient problem
GRU (Gated Recurrent Unit) - efficient LStM combining input/forget gates and streamlining output mechanism
Semi-Supervised Learning#
Self-Training - The model is first trained on labeled data. It then predicts labels for unlabeled data, adding high-confidence predictions to the labeled set iteratively to refine the model. Includes Pseudo Labelling
Co-Training - Two or more models are trained on different feature subsets of the data (like one model looks at the body of an email, another looks at the subject and sender, etc). Each model labels unlabeled data for the other, enabling them to learn from complementary views.
Multi-View Training - A variation of co-training where models train on different data representations (e.g., images and text) to predict the same output.
Graph-Based Models (Label Propagation) - Data is represented as a graph with nodes (data points) and edges (similarities). Labels are propagated from labeled nodes to unlabeled ones based on graph connectivity.
GAN (Generative Adversarial Network) (PyTorch Example) - create new, realistic data by learning from existing examples (creates good synthetic data)
Few-Shot Learning - a meta-learning process where you train the model to learn quickly from new and unseen data, so you don’t have to train it with a bunch of data initially. So I guess it does some quick additional learning when you “inference” it later?
Self-Supervised Learning#
Haven’t found specific examples for this yet, most links are to research papers.