Home Up PDF Prof. Dr. Ingo Claßen
Common Machine Learning Algorithms - DSML

Remarks

The follwing slides have been generated by Google Gemini

They are intended to give a rough overview

The formulas presented are just for reference

Ordinary Least Squares (OLS)

  • Mechanism: Fundamental linear method that finds parameters ($\hat\beta$) minimizing the Residual Sum of Squares (RSS), also known as the L2 loss :
    $$\min_{\beta} ||Y - X\beta||^2_2$$
  • Statistical Property: When Classical Linear Regression Model (CLRM) assumptions are met, OLS produces the Best Linear Unbiased Estimator (BLUE).
  • Limitations: Highly sensitive to outliers and prone to high variance in noisy or multicollinear data environments. Fails when the number of features ($p$) exceeds the number of samples ($n$).

When to Use OLS:

  • When the data relationship is believed to be truly linear.
  • When the primary goal is statistical inference (clear coefficient interpretability and hypothesis testing).

Regularized Linear Models: Lasso (L1)

  • Mechanism: Modifies OLS by adding an L1 penalty (sum of absolute coefficients) :
    $$\min_{\beta} ||Y - X\beta||^2_2 + \lambda \sum_{i=1}^p |\beta_i|$$
  • Key Feature: Feature Selection (Sparsity): The L1 penalty forces the coefficients of irrelevant or redundant features to become exactly zero.
  • Multicollinearity Issue: Lasso tends to arbitrarily select only one feature from a group of highly correlated predictors and zero out the rest.

When to Use Lasso:

  • For high-dimensional datasets where many features are expected to be irrelevant.
  • When model simplicity and interpretability through automatic feature selection are paramount.

Regularized Linear Models: Elastic Net

  • Mechanism: Combines the penalties of both L1 (Lasso) and L2 (Ridge) regularization, balancing feature selection with coefficient stability.
  • Dual Hyperparameters: Controls overall penalty strength ($\lambda$) and the mix between L1/L2 ($\alpha$).
  • Advantage: Grouping Effect: By incorporating the L2 penalty, Elastic Net addresses Lasso's instability with correlated features. It tends to include (or exclude) all highly correlated features together with similar weights.

When to Use Elastic Net:

  • As the robust, general-purpose default regularization technique for linear models.
  • When the dataset is both high-dimensional and known (or suspected) to contain significant multicollinearity.

Support Vector Machines (SVM)

  • Mechanism (Classification): Constructs the maximum-margin hyperplane that optimally separates data classes. The decision boundary is defined exclusively by the Support Vectors (data points closest to the margin).
  • Non-Linearity: The Kernel Trick: Allows SVM to implicitly map non-linearly separable input data into a higher-dimensional feature space where a linear boundary can be found.
  • Computational Profile: Training complexity is high, often $O(n^2p+n^3)$ where $n$ is samples, limiting scalability for very large datasets.

When to Use SVM:

  • For complex classification tasks in high-dimensional feature spaces.
  • When the dataset size is moderate (typically $N < 50,000$).

Nearest Neighbors (k-NN)

  • Mechanism: A simple, non-parametric, instance-based algorithm. Prediction is based on the majority class (or average value) of the $k$ training examples closest to the new data point.
  • Lazy Learning: There is no explicit training phase; the entire training dataset is stored until prediction time.
  • Limitations: Prediction speed is slow for large datasets ($O(N)$ complexity). Performance degrades significantly with high dimensionality (Curse of Dimensionality).

When to Use k-NN:

  • For small to moderate datasets where simplicity and local focus are valued.
  • When the training data is constantly updating and model refreshing must happen instantaneously.

Naive Bayes (NB)

  • Mechanism: A highly efficient probabilistic classifier based on Bayes' theorem.
  • Core Assumption: Features are assumed to be **conditionally independent** of each other, given the class label ("Naive").
  • Variants: Gaussian (continuous features), Multinomial (discrete counts/frequencies, ideal for text), Bernoulli (binary features).

When to Use Naive Bayes:

  • As the foundational method for text classification (spam filtering, sentiment analysis).
  • When computational speed and efficiency in high-dimensional discrete spaces are critical.

Decision Trees (DT)

  • Mechanism: Non-parametric, white-box model that partitions the feature space into if-then rules based on maximizing purity (e.g., Gini Impurity or Information Gain/Entropy).
  • Key Challenge: Overfitting: A single deep tree is highly prone to overfitting (high variance) and structural instability.
  • Regularization (Pruning): Must be controlled using constraints (Pre-pruning: max_depth , min_samples_leaf ) or post-pruning techniques (Cost-Complexity Pruning, Reduced Error Pruning).

When to Use Decision Trees:

  • When maximum interpretability and clear, traceable logic are non-negotiable requirements (e.g., regulated environments).
  • As the base learner for more robust ensemble methods (RF, GB).

Ensemble Method: Random Forests (RF)

  • Mechanism: Bagging (Bootstrap Aggregating): Trains multiple deep decision trees independently on bootstrapped data samples and uses a random subset of features for each split. Predictions are aggregated (voting/averaging).
  • Bias-Variance Focus: Primarily targets Variance Reduction. The randomization and averaging decorrelate the errors of individual high-variance trees, leading to superior generalization.
  • Practical Advantages: Robust to noisy features, handles missing data well, and is highly parallelizable (fast training).

When to Use Random Forests:

  • As a robust, high-performance default/baseline model for most supervised tasks.
  • When stability, fast training, and effective handling of noisy or missing data are required.

Ensemble Method: Gradient Boosting (GB)

  • Mechanism: Sequential Correction: Builds models iteratively. Each new, typically shallow decision tree is constructed to correct the systematic errors (pseudo-residuals) made by the combined ensemble of all previous trees.
  • Bias-Variance Focus: Aggressively targets Bias Reduction. This high focus on training error makes it prone to overfitting if not heavily regularized.
  • Regularization: Requires careful tuning of a small learning rate (shrinkage) and often subsampling to control variance.
  • Modern Variants: XGBoost (advanced regularization, missing data handling ), LightGBM (speed/efficiency for very large $N$), CatBoost (excels at categorical features).

When to Use Gradient Boosting:

  • When the absolute highest predictive accuracy is the primary goal on structured/tabular data.
  • When resources permit extensive hyperparameter tuning for maximum performance gains.

Comparison Table: Linear and Regularized Models

Algorithm Primary Mechanism Focus/Benefit Multicollinearity Handling When to Use
Ordinary Least Squares (OLS) Minimizes RSS (L2 Loss) BLUE Estimates, Maximum Interpretability Poor, highly sensitive to collinearity Data is small, linear assumptions hold, interpretation of coefficients is critical.
Lasso Regression L1 Regularization (Absolute Sum) Feature Selection (induces sparsity) Poor, arbitrarily selects one correlated feature High-dimensional data where few features are relevant; seeking a sparse model.
Elastic Net L1 + L2 Combination Stability and Feature Grouping Excellent, groups correlated features Datasets with high dimensions and significant multicollinearity.

Comparison Table: Other Models and Ensembles

Algorithm Core Principle Primary Error Focus Key Trade-off Applicability Context
Support Vector Machines (SVM) Maximum Margin Hyperplane (Kernel Trick) Reduction of Error / Optimization High Training Complexity ($O(n^2)$) vs. High Accuracy Medium-sized, high-dimensional data; superior classification when clear margin separation exists.
Nearest Neighbors (k-NN) Classification based on $k$ closest neighbors Local Data Structure is Predictive Zero Training Cost vs. Slow Prediction Time ($O(N)$) Simple, low-latency problems where updating the model is frequent; small datasets.
Naive Bayes (NB) Computes posterior probability via Bayes' Theorem Speed and Efficiency Strong Independence Assumption vs. Extremely Fast Performance Text classification, spam filtering, and high-dimensional categorical/discrete data.
Decision Tree (DT) Recursive Feature Space Partitioning High Variance / Overfitting High Interpretability (White-Box) vs. Low Predictive Stability When model interpretability is the paramount constraint; use as base learner for ensembles.
Random Forest (RF) Bagging (Parallel Ensemble) Reduction of Variance High Stability/Robustness vs. Reduced Interpretability; Fast Training Default, robust choice for high-performance and stable results.
Gradient Boosting (GB) Boosting (Sequential Ensemble) Reduction of Bias Maximum Accuracy vs. High Computational Cost and Tuning Sensitivity When the absolute highest predictive accuracy is required from structured data.

Strategic Selection Framework

The choice of algorithm depends on balancing core constraints: accuracy, interpretability, data size, and computational feasibility.

  1. Interpretability Required (White-Box):
    • Choose: OLS (for linear inference) or Decision Trees (for traceable, non-linear rules).
  2. High-Dimensional/Noisy Data:
    • Choose: Elastic Net (for balanced regularization and stability).
  3. Maximum Accuracy & Robustness:
    • Default Baseline: Random Forests (stable, fast, handles noise well).
    • Highest Ceiling: Gradient Boosting (if resources allow for extensive tuning on clean, structured data).
  4. Computational Constraints:
    • Avoid: General SVM solvers or k-NN inference for extremely large sample sizes ($N$) due to high computational complexity.
    • Prefer: Naive Bayes, Linear Models, or modern scaling Boosting methods (LightGBM) for massive datasets.