AI/ML Basics — Supervised vs Unsupervised Learning (Simple Guide + Code)

1) What is Machine Learning?

Machine Learning (ML) helps computers learn patterns from data so they can:

predict outcomes (e.g., house price)
classify things (e.g., spam vs not spam)
group similar items (e.g., customer segments)

2) Supervised vs Unsupervised Learning

✅ Supervised Learning (Labeled Data)

What

You train a model using:

input features X
known output labels/targets y

Example:

X = [size, bedrooms]
y = house_price

Goal

Learn a mapping:

X → y

Common problems

Regression: predict a number (price, demand, temperature)
Classification: predict a category (spam/ham, fraud/not fraud)

✅ Unsupervised Learning (Unlabeled Data)

What

You only have X, but no labels y.

Example:

customer data: spending, visits, age
(no “segment label” provided)

Goal

Discover structure:

clusters (groups)
similarity
hidden patterns

Common problems

Clustering (K-Means, Hierarchical)
dimensionality reduction (PCA)

3) Supervised Learning Algorithms (with Simple Code)

3.1 Linear Regression (Regression)

Use case

Predict a continuous value:

house price
sales forecast

Code (Simple)


from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
import numpy as np

# Sample data: X = [area], y = price
X = np.array([[500], [800], [1000], [1200], [1500], [1800]])
y = np.array([150, 220, 280, 330, 400, 480])  # price (in thousands)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = LinearRegression()
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Predictions:", pred)
print("MSE:", mean_squared_error(y_test, pred))
print("Slope (m):", model.coef_[0], "Intercept (b):", model.intercept_)

3.2 Logistic Regression (Classification)

Use case

Predict a category:

spam vs not spam
pass/fail
fraud/not fraud

Code (Iris dataset)


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

iris = load_iris()
X = iris.data
y = (iris.target == 0).astype(int)  # binary: setosa(1) vs others(0)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))
print(classification_report(y_test, pred))

3.3 Random Forest (Classification + Regression)

What

Random Forest is an ensemble of many decision trees.
It reduces overfitting and works well in practice.

A) Random Forest Classifier


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

iris = load_iris()
X, y = iris.data, iris.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

model = RandomForestClassifier(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, pred))

B) Random Forest Regressor


from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import numpy as np

X = np.array([[1], [2], [3], [4], [5], [6]])
y = np.array([3, 5, 7, 9, 11, 13])  # y = 2x + 1

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

model = RandomForestRegressor(n_estimators=200, random_state=42)
model.fit(X_train, y_train)

pred = model.predict(X_test)

print("Predictions:", pred)
print("MAE:", mean_absolute_error(y_test, pred))

4) Unsupervised Learning Algorithms (with Simple Code)

4.1 K-Means Clustering

What

K-Means groups points into K clusters by minimizing distance to cluster centers.

Use cases

customer segmentation
grouping similar products
anomaly detection (rough)

Code


from sklearn.cluster import KMeans
import numpy as np

# Example: customer data (spend, visits)
X = np.array([
    [100, 1], [120, 2], [130, 2],   # group 1
    [700, 8], [650, 7], [800, 9],   # group 2
    [300, 4], [320, 4], [280, 3]    # group 3
])

kmeans = KMeans(n_clusters=3, random_state=42)
labels = kmeans.fit_predict(X)

print("Cluster labels:", labels)
print("Centers:", kmeans.cluster_centers_)

Interpretation

Each row gets a cluster label (0/1/2)
Points with same label belong to the same group

4.2 Hierarchical Clustering (Agglomerative)

What

Builds clusters by progressively merging closest groups:

start with each point as its own cluster
merge until desired cluster count

Use cases

when you want a “cluster tree” (dendrogram concept)
small/medium datasets

Code


from sklearn.cluster import AgglomerativeClustering
import numpy as np

X = np.array([
    [1, 1], [2, 1], [2, 2],
    [8, 8], [9, 8], [8, 9]
])

model = AgglomerativeClustering(n_clusters=2, linkage="ward")
labels = model.fit_predict(X)

print("Cluster labels:", labels)

Note

"ward" works best with Euclidean distance
linkage options: ward, complete, average, single

5) When to Use Which Algorithm? (Simple Decision)

Supervised

✅ Linear Regression → numeric prediction, linear relationship
✅ Logistic Regression → simple classification, interpretable
✅ Random Forest → strong baseline for most tabular problems

Unsupervised

✅ K-Means → fast clustering when you know K
✅ Hierarchical → good when you want cluster structure and no need for huge scale

6) Interview-Friendly Summary (One Paragraph)

Supervised learning uses labeled data (X, y) to learn a mapping and is used for regression and classification (e.g., Linear Regression, Logistic Regression, Random Forest). Unsupervised learning uses only features X to find hidden patterns, mainly clustering (e.g., K-Means, Hierarchical). Linear regression predicts numbers, logistic regression predicts classes, random forests provide robust performance by combining many trees, and clustering algorithms group similar points without labels.

7) Quick Setup (Run These Examples)


pip install scikit-learn numpy