AI/ML Basics — Supervised vs Unsupervised Learning (Simple Guide + Code)

 

AI/ML Basics — Supervised vs Unsupervised Learning (Simple Guide + Code)

1) What is Machine Learning?

Machine Learning (ML) helps computers learn patterns from data so they can:

  • predict outcomes (e.g., house price)

  • classify things (e.g., spam vs not spam)

  • group similar items (e.g., customer segments)


2) Supervised vs Unsupervised Learning

✅ Supervised Learning (Labeled Data)

What

You train a model using:

  • input features X

  • known output labels/targets y

Example:

  • X = [size, bedrooms]

  • y = house_price

Goal

Learn a mapping:

X → y

Common problems

  • Regression: predict a number (price, demand, temperature)

  • Classification: predict a category (spam/ham, fraud/not fraud)


✅ Unsupervised Learning (Unlabeled Data)

What

You only have X, but no labels y.

Example:

  • customer data: spending, visits, age
    (no “segment label” provided)

Goal

Discover structure:

  • clusters (groups)

  • similarity

  • hidden patterns

Common problems

  • Clustering (K-Means, Hierarchical)

  • dimensionality reduction (PCA)


3) Supervised Learning Algorithms (with Simple Code)

3.1 Linear Regression (Regression)

Use case

Predict a continuous value:

  • house price

  • sales forecast

Code (Simple)

from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error import numpy as np # Sample data: X = [area], y = price X = np.array([[500], [800], [1000], [1200], [1500], [1800]]) y = np.array([150, 220, 280, 330, 400, 480]) # price (in thousands) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) model = LinearRegression() model.fit(X_train, y_train) pred = model.predict(X_test) print("Predictions:", pred) print("MSE:", mean_squared_error(y_test, pred)) print("Slope (m):", model.coef_[0], "Intercept (b):", model.intercept_)

3.2 Logistic Regression (Classification)

Use case

Predict a category:

  • spam vs not spam

  • pass/fail

  • fraud/not fraud

Code (Iris dataset)

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, classification_report iris = load_iris() X = iris.data y = (iris.target == 0).astype(int) # binary: setosa(1) vs others(0) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) model = LogisticRegression(max_iter=1000) model.fit(X_train, y_train) pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, pred)) print(classification_report(y_test, pred))

3.3 Random Forest (Classification + Regression)

What

Random Forest is an ensemble of many decision trees.
It reduces overfitting and works well in practice.

A) Random Forest Classifier

from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import accuracy_score iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42) model = RandomForestClassifier(n_estimators=200, random_state=42) model.fit(X_train, y_train) pred = model.predict(X_test) print("Accuracy:", accuracy_score(y_test, pred))

B) Random Forest Regressor

from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestRegressor from sklearn.metrics import mean_absolute_error import numpy as np X = np.array([[1], [2], [3], [4], [5], [6]]) y = np.array([3, 5, 7, 9, 11, 13]) # y = 2x + 1 X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42) model = RandomForestRegressor(n_estimators=200, random_state=42) model.fit(X_train, y_train) pred = model.predict(X_test) print("Predictions:", pred) print("MAE:", mean_absolute_error(y_test, pred))

4) Unsupervised Learning Algorithms (with Simple Code)

4.1 K-Means Clustering

What

K-Means groups points into K clusters by minimizing distance to cluster centers.

Use cases

  • customer segmentation

  • grouping similar products

  • anomaly detection (rough)

Code

from sklearn.cluster import KMeans import numpy as np # Example: customer data (spend, visits) X = np.array([ [100, 1], [120, 2], [130, 2], # group 1 [700, 8], [650, 7], [800, 9], # group 2 [300, 4], [320, 4], [280, 3] # group 3 ]) kmeans = KMeans(n_clusters=3, random_state=42) labels = kmeans.fit_predict(X) print("Cluster labels:", labels) print("Centers:", kmeans.cluster_centers_)

Interpretation

  • Each row gets a cluster label (0/1/2)

  • Points with same label belong to the same group


4.2 Hierarchical Clustering (Agglomerative)

What

Builds clusters by progressively merging closest groups:

  • start with each point as its own cluster

  • merge until desired cluster count

Use cases

  • when you want a “cluster tree” (dendrogram concept)

  • small/medium datasets

Code

from sklearn.cluster import AgglomerativeClustering import numpy as np X = np.array([ [1, 1], [2, 1], [2, 2], [8, 8], [9, 8], [8, 9] ]) model = AgglomerativeClustering(n_clusters=2, linkage="ward") labels = model.fit_predict(X) print("Cluster labels:", labels)

Note

  • "ward" works best with Euclidean distance

  • linkage options: ward, complete, average, single


5) When to Use Which Algorithm? (Simple Decision)

Supervised

✅ Linear Regression → numeric prediction, linear relationship
✅ Logistic Regression → simple classification, interpretable
✅ Random Forest → strong baseline for most tabular problems

Unsupervised

✅ K-Means → fast clustering when you know K
✅ Hierarchical → good when you want cluster structure and no need for huge scale


6) Interview-Friendly Summary (One Paragraph)

Supervised learning uses labeled data (X, y) to learn a mapping and is used for regression and classification (e.g., Linear Regression, Logistic Regression, Random Forest). Unsupervised learning uses only features X to find hidden patterns, mainly clustering (e.g., K-Means, Hierarchical). Linear regression predicts numbers, logistic regression predicts classes, random forests provide robust performance by combining many trees, and clustering algorithms group similar points without labels.


7) Quick Setup (Run These Examples)

pip install scikit-learn numpy

No comments:

Post a Comment

Confusion Matrix + Precision/Recall (Super Simple, With Examples)

  Confusion Matrix + Precision/Recall (Super Simple, With Examples) 1) Binary Classification Setup Binary classification means the model p...

Featured Posts