Remarks

The follwing slides have been generated by Google Gemini

They are intended to give a rough overview

The formulas presented are just for reference

Ordinary Least Squares (OLS)

Mechanism: Fundamental linear method that finds parameters ($\hat\beta$) minimizing the Residual Sum of Squares (RSS), also known as the L2 loss :
$$\min_{\beta} ||Y - X\beta||^2_2$$
Statistical Property: When Classical Linear Regression Model (CLRM) assumptions are met, OLS produces the Best Linear Unbiased Estimator (BLUE).
Limitations: Highly sensitive to outliers and prone to high variance in noisy or multicollinear data environments. Fails when the number of features ($p$) exceeds the number of samples ($n$).

When to Use OLS:

When the data relationship is believed to be truly linear.
When the primary goal is statistical inference (clear coefficient interpretability and hypothesis testing).

Regularized Linear Models: Lasso (L1)

Mechanism: Modifies OLS by adding an L1 penalty (sum of absolute coefficients) :
$$\min_{\beta} ||Y - X\beta||^2_2 + \lambda \sum_{i=1}^p |\beta_i|$$
Key Feature: Feature Selection (Sparsity): The L1 penalty forces the coefficients of irrelevant or redundant features to become exactly zero.
Multicollinearity Issue: Lasso tends to arbitrarily select only one feature from a group of highly correlated predictors and zero out the rest.

When to Use Lasso:

For high-dimensional datasets where many features are expected to be irrelevant.
When model simplicity and interpretability through automatic feature selection are paramount.

Regularized Linear Models: Elastic Net

Mechanism: Combines the penalties of both L1 (Lasso) and L2 (Ridge) regularization, balancing feature selection with coefficient stability.
Dual Hyperparameters: Controls overall penalty strength ($\lambda$) and the mix between L1/L2 ($\alpha$).
Advantage: Grouping Effect: By incorporating the L2 penalty, Elastic Net addresses Lasso's instability with correlated features. It tends to include (or exclude) all highly correlated features together with similar weights.

When to Use Elastic Net:

As the robust, general-purpose default regularization technique for linear models.
When the dataset is both high-dimensional and known (or suspected) to contain significant multicollinearity.

Support Vector Machines (SVM)

Mechanism (Classification): Constructs the maximum-margin hyperplane that optimally separates data classes. The decision boundary is defined exclusively by the Support Vectors (data points closest to the margin).
Non-Linearity: The Kernel Trick: Allows SVM to implicitly map non-linearly separable input data into a higher-dimensional feature space where a linear boundary can be found.
Computational Profile: Training complexity is high, often $O(n^2p+n^3)$ where $n$ is samples, limiting scalability for very large datasets.

When to Use SVM:

For complex classification tasks in high-dimensional feature spaces.
When the dataset size is moderate (typically $N < 50,000$).

Nearest Neighbors (k-NN)

Mechanism: A simple, non-parametric, instance-based algorithm. Prediction is based on the majority class (or average value) of the $k$ training examples closest to the new data point.
Lazy Learning: There is no explicit training phase; the entire training dataset is stored until prediction time.
Limitations: Prediction speed is slow for large datasets ($O(N)$ complexity). Performance degrades significantly with high dimensionality (Curse of Dimensionality).

When to Use k-NN:

For small to moderate datasets where simplicity and local focus are valued.
When the training data is constantly updating and model refreshing must happen instantaneously.

Naive Bayes (NB)

Mechanism: A highly efficient probabilistic classifier based on Bayes' theorem.
Core Assumption: Features are assumed to be **conditionally independent** of each other, given the class label ("Naive").
Variants: Gaussian (continuous features), Multinomial (discrete counts/frequencies, ideal for text), Bernoulli (binary features).

When to Use Naive Bayes:

As the foundational method for text classification (spam filtering, sentiment analysis).
When computational speed and efficiency in high-dimensional discrete spaces are critical.

Decision Trees (DT)

Mechanism: Non-parametric, white-box model that partitions the feature space into if-then rules based on maximizing purity (e.g., Gini Impurity or Information Gain/Entropy).
Key Challenge: Overfitting: A single deep tree is highly prone to overfitting (high variance) and structural instability.
Regularization (Pruning): Must be controlled using constraints (Pre-pruning: max_depth , min_samples_leaf ) or post-pruning techniques (Cost-Complexity Pruning, Reduced Error Pruning).

When to Use Decision Trees:

When maximum interpretability and clear, traceable logic are non-negotiable requirements (e.g., regulated environments).
As the base learner for more robust ensemble methods (RF, GB).

Ensemble Method: Random Forests (RF)

Mechanism: Bagging (Bootstrap Aggregating): Trains multiple deep decision trees independently on bootstrapped data samples and uses a random subset of features for each split. Predictions are aggregated (voting/averaging).
Bias-Variance Focus: Primarily targets Variance Reduction. The randomization and averaging decorrelate the errors of individual high-variance trees, leading to superior generalization.
Practical Advantages: Robust to noisy features, handles missing data well, and is highly parallelizable (fast training).

When to Use Random Forests:

As a robust, high-performance default/baseline model for most supervised tasks.
When stability, fast training, and effective handling of noisy or missing data are required.

Ensemble Method: Gradient Boosting (GB)

Mechanism: Sequential Correction: Builds models iteratively. Each new, typically shallow decision tree is constructed to correct the systematic errors (pseudo-residuals) made by the combined ensemble of all previous trees.
Bias-Variance Focus: Aggressively targets Bias Reduction. This high focus on training error makes it prone to overfitting if not heavily regularized.
Regularization: Requires careful tuning of a small learning rate (shrinkage) and often subsampling to control variance.
Modern Variants: XGBoost (advanced regularization, missing data handling ), LightGBM (speed/efficiency for very large $N$), CatBoost (excels at categorical features).

When to Use Gradient Boosting:

When the absolute highest predictive accuracy is the primary goal on structured/tabular data.
When resources permit extensive hyperparameter tuning for maximum performance gains.

Comparison Table: Linear and Regularized Models

Algorithm	Primary Mechanism	Focus/Benefit	Multicollinearity Handling	When to Use
Ordinary Least Squares (OLS)	Minimizes RSS (L2 Loss)	BLUE Estimates, Maximum Interpretability	Poor, highly sensitive to collinearity	Data is small, linear assumptions hold, interpretation of coefficients is critical.
Lasso Regression	L1 Regularization (Absolute Sum)	Feature Selection (induces sparsity)	Poor, arbitrarily selects one correlated feature	High-dimensional data where few features are relevant; seeking a sparse model.
Elastic Net	L1 + L2 Combination	Stability and Feature Grouping	Excellent, groups correlated features	Datasets with high dimensions and significant multicollinearity.

Comparison Table: Other Models and Ensembles

Algorithm	Core Principle	Primary Error Focus	Key Trade-off	Applicability Context
Support Vector Machines (SVM)	Maximum Margin Hyperplane (Kernel Trick)	Reduction of Error / Optimization	High Training Complexity ($O(n^2)$) vs. High Accuracy	Medium-sized, high-dimensional data; superior classification when clear margin separation exists.
Nearest Neighbors (k-NN)	Classification based on $k$ closest neighbors	Local Data Structure is Predictive	Zero Training Cost vs. Slow Prediction Time ($O(N)$)	Simple, low-latency problems where updating the model is frequent; small datasets.
Naive Bayes (NB)	Computes posterior probability via Bayes' Theorem	Speed and Efficiency	Strong Independence Assumption vs. Extremely Fast Performance	Text classification, spam filtering, and high-dimensional categorical/discrete data.
Decision Tree (DT)	Recursive Feature Space Partitioning	High Variance / Overfitting	High Interpretability (White-Box) vs. Low Predictive Stability	When model interpretability is the paramount constraint; use as base learner for ensembles.
Random Forest (RF)	Bagging (Parallel Ensemble)	Reduction of Variance	High Stability/Robustness vs. Reduced Interpretability; Fast Training	Default, robust choice for high-performance and stable results.
Gradient Boosting (GB)	Boosting (Sequential Ensemble)	Reduction of Bias	Maximum Accuracy vs. High Computational Cost and Tuning Sensitivity	When the absolute highest predictive accuracy is required from structured data.

Strategic Selection Framework

The choice of algorithm depends on balancing core constraints: accuracy, interpretability, data size, and computational feasibility.

Interpretability Required (White-Box):
- Choose: OLS (for linear inference) or Decision Trees (for traceable, non-linear rules).
High-Dimensional/Noisy Data:
- Choose: Elastic Net (for balanced regularization and stability).
Maximum Accuracy & Robustness:
- Default Baseline: Random Forests (stable, fast, handles noise well).
- Highest Ceiling: Gradient Boosting (if resources allow for extensive tuning on clean, structured data).
Computational Constraints:
- Avoid: General SVM solvers or k-NN inference for extremely large sample sizes ($N$) due to high computational complexity.
- Prefer: Naive Bayes, Linear Models, or modern scaling Boosting methods (LightGBM) for massive datasets.