Choosing the Right Algorithm Based on Data Characteristics

“Algorithms aren’t magic. They’re tools that need the right kind of data to work effectively.”

For each algorithm, we’ll outline:

  1. What the data needs to look like for it to work well
  2. What kind of data will break it or mislead it

Decision Tree

Data Characteristics it Loves:

  • Data where decision rules are clear (e.g., “If X > threshold, predict Y”).
  • Mix of categorical and numerical variables.
  • Small to medium-sized datasets.
  • Features with low noise in splits.

What Breaks It:

  • Datasets with many subtle interactions between features.
  • Very noisy data (leads to overfitting).
  • Continuous features with no obvious splitting points.

Checklist:
Rule-based patterns in data
Features you want to visualize as logic flows
* Some tolerance for small datasets

Random Forest

Data Characteristics it Loves:

  • Large, messy datasets with both categorical and numerical features.
  • High-dimensional feature spaces.
  • Data with non-linear relationships.
  • Datasets with moderate noise.
  • Datasets where feature importance matters.

What Breaks It:

  • Requires a lot of computation for large datasets.
  • Hard to use if model explainability is critical.
  • Doesn’t perform well on extremely sparse datasets (e.g., text data without embeddings).

Checklist:
Plenty of data (thousands to millions of rows)
Both categorical & numeric features
* Performance > Explainability

k-Nearest Neighbors (k-NN)

Data Characteristics it Loves:

  • Small datasets where relationships are localized (spatial, time proximity).
  • Numerical features where distance makes sense.
  • Low-dimensional data or data preprocessed with PCA or dimensionality reduction.
  • Few irrelevant features.

What Breaks It:

  • High-dimensional datasets (curse of dimensionality).
  • Data with irrelevant or noisy features (pollutes neighbor selection).
  • Large datasets (slow predictions).

Checklist:
Dataset fits in memory
Features can be compared using distance metrics
* Task has natural “neighborhoods” (e.g., location or clusters in time)

Universal Criteria to Always Consider

Regardless of algorithm, make students reflect on:

Criterion Why It Matters
Data Cleanliness Missing data or outliers will skew most algorithms.
Feature Relevance Irrelevant features can degrade accuracy, especially for k-NN.
Dimensionality High dimensions → bad for k-NN, challenging for visualizations.
Size of Dataset Logistic Regression handles small datasets well; Random Forest loves big data.
Interpretability Needs Decision Trees & Logistic Regression are explainable; Random Forest is not.
Class Balance (for classifiers) Imbalanced classes can fool accuracy metrics.

Logistic Regression — Data Characteristics Breakdown

1. Features have a Linear Relationship with the Target Outcome

  • What this means:
    The model assumes that the log-odds of the outcome (probability of being in a class) is a linear combination of the input features.

  • Why it matters:
    Logistic regression can only draw straight lines (or planes) to separate classes in feature space.

  • If the actual data distribution requires curved or complex boundaries, logistic regression will struggle.

  • Example: If incidents that occur after 5 PM are sometimes hazardous and sometimes not (depending on weather), a straight-line model won't capture the interaction.

  • Bottom Line:
    Logistic Regression shines when a weighted sum of features can clearly separate the classes.

2. The Dataset Isn’t Too Noisy or High-Dimensional

  • What this means:

  • Noise \= random, irrelevant fluctuations in feature values (like typos, sensor glitches, or chaotic patterns).

  • High-dimensionality \= having too many features, many of which may be irrelevant.

  • Why it matters:

  • Logistic Regression looks for global patterns across the entire dataset.

  • Noise can pull the learned weights in unhelpful directions (overfitting to random fluctuations).
  • In high-dimensional spaces, unless many of the features are actually relevant, the model can get overwhelmed (too many coefficients to estimate).
  • Logistic Regression doesn’t have built-in feature selection—it treats every feature as potentially meaningful.

  • Bottom Line:
    It performs best when most of the features carry signal and aren’t drowning in random noise or irrelevant dimensions.

3. Features are Independent of Each Other (Low Multicollinearity)

  • What this means:
    Multicollinearity happens when two or more input features are highly correlated (e.g., latitude and longitude for a small area).

  • Why it matters:

  • Logistic Regression struggles with highly correlated features because it becomes ambiguous which feature gets credit for a prediction.

  • The model can still make accurate predictions, but the coefficients become unstable — they might flip signs or get huge magnitudes.
  • It also hurts interpretability: you can’t trust that a large coefficient means a strong effect if multicollinearity exists.

  • Bottom Line:

Logistic Regression assumes that each feature brings independent information to the table. If not, regularization (like Ridge or Lasso) is needed to dampen instability.

4. Small to Medium-Sized Datasets (Thousands to Tens of Thousands of Rows)

  • What this means:
    Logistic Regression doesn’t need massive datasets to work effectively.

  • Why it matters:

  • With fewer samples, simpler models (like Logistic Regression) tend to generalize better.

  • More complex models (like deep trees or neural networks) can easily overfit small data.
  • Logistic Regression is computationally lightweight — you can train it in seconds, even with several thousand rows.

  • Bottom Line:
    It’s a go-to model when you don’t have big data, or when you need a quick baseline model.


Why Logistic Regression “Breaks” Outside These Conditions

Problem What Happens
Non-linear relationships Logistic regression fails to capture important interactions.
Too much noise The model overfits random patterns → poor generalization.
High multicollinearity Coefficients become unstable → unreliable interpretations.
Too many irrelevant features The model gets “distracted” → low accuracy.

Decision Trees — Data Characteristics Breakdown

1. Data Where Decision Rules are Clear

  • What this means:
    The data should be separable by simple if-then rules. Example: “If issue \= ‘Flood’ → send to Water Dept.”

  • Why it matters:

  • Decision Trees create hard splits in feature space.

  • They’re excellent when your data can be partitioned into distinct, meaningful groups by thresholds or categories.

  • Scenario Fit:
    “Are incidents near schools and between 7-9 AM likely to be flagged as high-priority?” → A decision tree captures this easily.

2. Mix of Categorical and Numerical Variables

  • What this means:
    The dataset has features like “Issue Type” (categorical) and “Time of Day” (numerical).

  • Why it matters:

  • Decision Trees can handle both types of data natively.

  • You don’t need to one-hot encode or scale features.
  • Categorical splits are done via subsets (“Is issue in {Crash, Hazard}?”), while numeric splits are thresholds.

  • Scenario Fit:
    Routing based on location names and incident times in the traffic dataset works naturally.

3. Low-Noise Data in Splits

  • What this means:
    For each decision split, the result should lead to purity (nodes with mostly one class).

  • Why it matters:

  • If data is noisy (e.g., lots of edge cases where patterns are inconsistent), trees will create deep, complex branches that overfit.

  • Clean, structured data leads to shallow, generalizable trees.

  • Scenario Fit:
    A decision tree can easily distinguish between "Active Incidents" and "Archived Incidents" if status codes are clean and consistent.

What Breaks Decision Trees:

Problem Impact
Noisy or inconsistent splits Tree becomes too deep, memorizes noise (overfitting).
Subtle feature interactions Tree struggles to model interactions unless the tree becomes huge.
Small data perturbations Can lead to different splits — unstable outcomes.

Clustering Algorithms — Deep Dive Breakdown

K-Means Clustering

Origin & Intuition:

  • One of the oldest and most popular clustering algorithms.
  • K-Means partitions data into K groups (clusters) by minimizing the distance between data points and their cluster’s centroid.
  • Unsupervised Learning: No labels required.

How it Works:

  1. Choose K (number of clusters).
  2. Randomly assign K initial centroids.
  3. Assign each data point to the nearest centroid.
  4. Recalculate the centroid of each cluster.
  5. Repeat steps 3–4 until convergence (no change).

Loves Data That Is:

  • Globular and well-separated clusters.

  • Numeric features where Euclidean distance is meaningful.

  • Low to moderate dimensionality.

  • Data without too many outliers.

Breaks When:

  • Clusters have irregular shapes (e.g., elongated or nested).

  • The number of clusters K is hard to define upfront.

  • Sensitive to outliers — they can drag centroids away.

  • Features are on different scales → distance measures get skewed.

Metaphor: Assigning Students to Dorm Rooms Based on Proximity

  • You have K dorm rooms (clusters) and want students to share rooms based on how “close” they are in interests.

  • You shuffle people around, aiming to minimize walking distance to their room’s “vibe center” (centroid).

  • If someone’s interests are unique (outlier), they can distort the whole arrangement.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Origin & Intuition

  • Developed in 1996, DBSCAN clusters data based on density rather than distance to centroids.

  • Identifies dense regions as clusters and treats low-density regions as noise (outliers).

How it Works:

  1. Define eps (radius for neighborhood) and minPts (minimum points for a dense region).

  2. For each point:

  3. If at least minPts are within eps, it’s a core point.

  4. If a point is within eps of a core point, it’s a border point.

  5. Points that aren’t close to any core point → noise.

  6. Expand clusters from core points until no more points can be added.

Loves Data That’s:

  • Clusters of arbitrary shape (not just globular).

  • Data with noise or outliers — it can filter them out naturally.

  • Spatial data or datasets where “closeness” is density-based.

  • You don’t need to specify the number of clusters upfront.

Breaks When:

  • Clusters have varying densities (DBSCAN struggles to adapt a single eps value).

  • High-dimensional data — distances become meaningless (curse of dimensionality).

  • Sensitive to choice of eps and minPts — poor parameters lead to over-clustering or missing clusters.

Metaphor: Finding Crowds at a Party

  • Imagine you’re at a large party.

  • DBSCAN walks around asking, “Are there at least minPts people within eps steps of me?”

  • If yes, that’s a crowd (cluster). If someone is standing alone far away, they’re considered an outlier.

  • Works for parties where crowds are irregular (e.g., people gathering in random nooks).

Hierarchical Clustering

Origin & Intuition:

  • A family of clustering methods that build a hierarchy of clusters.

  • Does not require specifying the number of clusters upfront.

  • Produces a dendrogram (tree diagram) showing how clusters merge or split.

Types:

  1. Agglomerative (Bottom-Up):

  2. Each data point starts as its own cluster.

  3. Iteratively merge the closest clusters until all points are in one cluster.

  4. Divisive (Top-Down):

  5. Start with all data points in one cluster.

  6. Recursively split into smaller clusters.

Linkage Methods (to define “closeness” between clusters):

  • Single Linkage (nearest neighbor)

  • Complete Linkage (furthest neighbor)

  • Average Linkage (mean distance)

Loves Data That’s:

  • Data where nested clusters or sub-groupings are meaningful.

  • Small to medium datasets (hierarchical clustering is computationally expensive).

  • Scenarios where you want to explore the cluster structure at multiple levels.

  • Data with both categorical and numerical features (when using appropriate distance metrics).

Breaks When:

  • Large datasets → becomes computationally infeasible.

  • Sensitive to noisy data and outliers (can cause early incorrect merges).

  • Once a merge/split is done, it’s irreversible — bad early decisions propagate up the hierarchy.

Metaphor: Organizing Books into Categories

  • Start by piling all books together.

  • First, split fiction from non-fiction.

  • Then within fiction, group by genre.

  • Within genres, group by author.

  • The dendrogram is like a taxonomy tree — zoom in/out to see clusters at different levels.

Summary Table: Clustering Algorithm Fit Explained

Algorithm Loves Data With Struggles With
K-Means Globular, well-separated clusters, no outliers Irregular shapes, noisy data, wrong choice of K
DBSCAN Arbitrary-shaped clusters, density-based separation Varying densities, high dimensions, sensitive parameters
Hierarchical Nested subgroupings, exploratory cluster discovery Large datasets, noise sensitivity, irreversible merges

Random Forest — Data Characteristics Breakdown

1. Large, Messy Datasets with Categorical & Numerical Features

  • What this means:
    You have lots of data with mixed types, and not all of it is clean.

  • Why it matters:

  • Random Forest’s ensemble approach averages out noise.

  • If individual trees overfit to quirks, the forest smooths out those quirks.

  • Works great with datasets where patterns exist, but not cleanly.

  • Scenario Fit:
    Predicting incident type from mixed categorical (issue) and numerical (response time) data where there’s occasional messiness.

2. High-Dimensional Feature Spaces

  • What this means:
    You have a lot of features (possibly redundant), but you don’t want to manually select the most important ones.

  • Why it matters:

  • Each tree in a Random Forest sees a random subset of features.

  • This randomness helps decorrelate trees and reduce overfitting, even when many features are weak predictors.

  • Scenario Fit:
    Use all available data — time, location, weather, incident type — and let the Random Forest figure out which features matter.

3. Data with Non-Linear Relationships

  • What this means:
    Patterns that are not linearly separable (e.g., interaction between time and location affects response urgency).

  • Why it matters:

  • Random Forest can learn complex patterns by combining diverse decision trees.

  • It can model feature interactions implicitly.

  • Scenario Fit:
    Identifying incident severity that depends on time + location + type interactions.

What Breaks Random Forest:

Problem Impact
Need for full explainability Forests are harder to interpret than single trees.
Huge datasets (very large N) Training/inference can be slower due to many trees.
Extremely sparse data Struggles without feature engineering to densify input signals.

k-Nearest Neighbors (k-NN) — Data Characteristics Breakdown

1. Small Datasets with Local Relationships

  • What this means:
    The dataset isn’t too large (a few thousand samples) and spatial or time proximity matters.

  • Why it matters:

  • k-NN doesn’t “learn” — it stores all data points and compares new inputs to them.

  • The assumption is “things close together behave similarly.”

  • Works best when proximity is meaningful.

  • Scenario Fit:
    Predicting incident category by looking at what typically happens on that specific street at that specific time.

2. Numeric Features where Distance Makes Sense

  • What this means:
    Features should be on a common scale, and the notion of distance should reflect similarity.

  • Why it matters:

  • If features are on vastly different scales (e.g., “Time in seconds” vs “Location ID”), distance calculations will be skewed.

  • k-NN assumes Euclidean (or other) distance captures similarity.

  • Scenario Fit:
    Time of day, response time, and geospatial coordinates are good candidates — they have meaningful distances.

3. Low-Dimensional, Noise-Free Data

  • What this means:
    The number of features should be relatively small, and those features should be relevant.

  • Why it matters:

  • In high-dimensional space, all points tend to look equidistant (curse of dimensionality).

  • Irrelevant/noisy features dilute the influence of the nearest neighbors.

  • Scenario Fit:
    Simplified datasets (e.g., “Location”, “Issue Type”, “Hour of Day”) where neighborhoods in feature space correspond to similar incidents.

What Breaks k-NN:

Problem Impact
Large datasets Prediction becomes slow (search over all training points).
High-dimensional feature space Nearest neighbors become meaningless (curse of dimensionality).
Irrelevant/Noisy Features Wrong neighbors get selected → poor predictions.