Why Do Metrics Matter?
Metrics are how you measure how well an AI model is performing.
Simple analogy:
A student takes an exam:
Grade = success metric
An AI model makes predictions:
Accuracy, Precision, Recall = quality metrics
Without metrics, you can’t tell if your model is working! 📈
Core Metrics
1. Accuracy
The percentage of correct answers out of all predictions.
def calculate_accuracy(correct, total):
"""Accuracy = correct / total."""
if total == 0:
return 0
return (correct / total) * 100
# Example
correct = 85 # 85 correct
total = 100 # out of 100 total
accuracy = calculate_accuracy(correct, total)
print(f"Accuracy: {accuracy}%") # 85%
When to use:
- ✅ Classes are balanced (50/50)
- ❌ Classes are imbalanced (95/5)
Confusion Matrix
The foundation of all classification metrics!
Reality
Positive Negative
Model
Positive TP FP (True/False Positive)
Negative FN TN (False/True Negative)
- TP (True Positive) — correctly identified positives
- TN (True Negative) — correctly identified negatives
- FP (False Positive) — false alarm
- FN (False Negative) — missed positives
Example: spam detector
def create_confusion_matrix(predictions, actuals):
"""Build a confusion matrix."""
tp = sum(1 for p, a in zip(predictions, actuals)
if p == 'spam' and a == 'spam')
tn = sum(1 for p, a in zip(predictions, actuals)
if p == 'not_spam' and a == 'not_spam')
fp = sum(1 for p, a in zip(predictions, actuals)
if p == 'spam' and a == 'not_spam')
fn = sum(1 for p, a in zip(predictions, actuals)
if p == 'not_spam' and a == 'spam')
return {
"TP": tp, # Spam correctly identified
"TN": tn, # Non-spam correctly identified
"FP": fp, # Legit email flagged as spam
"FN": fn # Spam that got through
}
# Test
predictions = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
actuals = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']
matrix = create_confusion_matrix(predictions, actuals)
print(matrix)
# {'TP': 2, 'TN': 2, 'FP': 1, 'FN': 0}
Precision
Of all “positive” predictions, how many were actually correct?
def calculate_precision(tp, fp):
"""Precision = TP / (TP + FP)."""
if tp + fp == 0:
return 0
return tp / (tp + fp)
# Example
tp = 80 # 80 correctly identified spam emails
fp = 20 # 20 legit emails flagged as spam
precision = calculate_precision(tp, fp)
print(f"Precision: {precision:.2%}") # 80%
When does Precision matter?
- 🚨 Medical diagnosis — false positives are costly
- 📧 Spam filter — avoid deleting legitimate emails
- ⚖️ Legal decisions — false accusations are expensive
Precision answers: Can we trust positive predictions?
Recall
Of all real positives, how many did we find?
def calculate_recall(tp, fn):
"""Recall = TP / (TP + FN)."""
if tp + fn == 0:
return 0
return tp / (tp + fn)
# Example
tp = 80 # 80 spam emails found
fn = 20 # 20 spam emails missed
recall = calculate_recall(tp, fn)
print(f"Recall: {recall:.2%}") # 80%
When does Recall matter?
- 🔍 Disease screening — can’t miss a diagnosis
- 🛡️ Fraud detection — need to catch every case
- 🔐 Security — better safe than sorry
Recall answers: Are we finding all positive cases?
Precision vs Recall
The Trade-off
def demonstrate_tradeoff():
"""Illustrate the Precision/Recall trade-off."""
# Strict model (few FP, but many FN)
strict_model = {
"TP": 60, "FP": 5, "FN": 40, "TN": 95,
"precision": 0.92, # High!
"recall": 0.60 # Low!
}
# Lenient model (few FN, but many FP)
lenient_model = {
"TP": 95, "FP": 30, "FN": 5, "TN": 70,
"precision": 0.76, # Low!
"recall": 0.95 # High!
}
return strict_model, lenient_model
strict, lenient = demonstrate_tradeoff()
print("Strict model:")
print(f" Precision: {strict['precision']:.2%}")
print(f" Recall: {strict['recall']:.2%}")
print("\nLenient model:")
print(f" Precision: {lenient['precision']:.2%}")
print(f" Recall: {lenient['recall']:.2%}")
You can’t maximize both at the same time!
F1-Score (Harmonic Mean)
A balance between Precision and Recall.
def calculate_f1_score(precision, recall):
"""F1 = 2 * (Precision * Recall) / (Precision + Recall)."""
if precision + recall == 0:
return 0
return 2 * (precision * recall) / (precision + recall)
# Example
precision = 0.80 # 80%
recall = 0.75 # 75%
f1 = calculate_f1_score(precision, recall)
print(f"F1-Score: {f1:.2%}") # 77.42%
When to use F1?
- ✅ You need a balance between Precision and Recall
- ✅ Classes are imbalanced
- ✅ Both error types (FP and FN) matter
F1 = 1 — perfect (Precision = Recall = 100%)
F1 = 0 — model finds nothing
Full Example: Model Evaluation
class ModelEvaluator:
"""Class for evaluating an AI model."""
def __init__(self):
self.tp = 0
self.tn = 0
self.fp = 0
self.fn = 0
def evaluate(self, predictions, actuals):
"""Evaluate predictions."""
for pred, actual in zip(predictions, actuals):
if pred == 1 and actual == 1:
self.tp += 1
elif pred == 0 and actual == 0:
self.tn += 1
elif pred == 1 and actual == 0:
self.fp += 1
else: # pred == 0 and actual == 1
self.fn += 1
def get_accuracy(self):
"""Accuracy."""
total = self.tp + self.tn + self.fp + self.fn
if total == 0:
return 0
return (self.tp + self.tn) / total
def get_precision(self):
"""Precision."""
if self.tp + self.fp == 0:
return 0
return self.tp / (self.tp + self.fp)
def get_recall(self):
"""Recall."""
if self.tp + self.fn == 0:
return 0
return self.tp / (self.tp + self.fn)
def get_f1_score(self):
"""F1-Score."""
precision = self.get_precision()
recall = self.get_recall()
if precision + recall == 0:
return 0
return 2 * (precision * recall) / (precision + recall)
def get_report(self):
"""Full report."""
return {
"confusion_matrix": {
"TP": self.tp,
"TN": self.tn,
"FP": self.fp,
"FN": self.fn
},
"metrics": {
"accuracy": f"{self.get_accuracy():.2%}",
"precision": f"{self.get_precision():.2%}",
"recall": f"{self.get_recall():.2%}",
"f1_score": f"{self.get_f1_score():.2%}"
}
}
# Usage
evaluator = ModelEvaluator()
# Test data
predictions = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
actuals = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]
evaluator.evaluate(predictions, actuals)
# Report
report = evaluator.get_report()
print("📊 Model report:")
print(f"\nConfusion matrix:")
for key, value in report["confusion_matrix"].items():
print(f" {key}: {value}")
print(f"\nMetrics:")
for key, value in report["metrics"].items():
print(f" {key}: {value}")
Additional Metrics
Specificity
Of all negatives, how many did we correctly identify?
def calculate_specificity(tn, fp):
"""Specificity = TN / (TN + FP)."""
if tn + fp == 0:
return 0
return tn / (tn + fp)
# Example: non-spam correctly identified
tn = 90 # 90 legit emails correctly classified
fp = 10 # 10 legit emails flagged as spam
specificity = calculate_specificity(tn, fp)
print(f"Specificity: {specificity:.2%}") # 90%
ROC AUC (Area Under the Curve)
def calculate_roc_auc_simple(tpr, fpr):
"""Simplified AUC (real calculation is more complex)."""
# True Positive Rate vs False Positive Rate
auc = (1 - fpr + tpr) / 2
return auc
tpr = 0.85 # True Positive Rate (Recall)
fpr = 0.10 # False Positive Rate
auc = calculate_roc_auc_simple(tpr, fpr)
print(f"AUC: {auc:.2%}") # ~87.5%
AUC = 1.0 — perfect model
AUC = 0.5 — random guessing
Choosing the Right Metric
Decision table:
| Task | Primary metric | Why |
|---|---|---|
| Spam filter | Precision | Avoid deleting legit emails |
| Disease detection | Recall | Can’t miss a diagnosis |
| Product recommendations | F1-Score | Balance precision and coverage |
| Face recognition | Accuracy | Classes are balanced |
| Fraud detection | Recall | Catch every case |
Code for metric selection:
def recommend_metric(task_type):
"""Recommend a metric for a given task."""
recommendations = {
"spam_filter": {
"primary": "Precision",
"reason": "Can't delete legit emails"
},
"disease_detection": {
"primary": "Recall",
"reason": "Can't miss a disease"
},
"fraud_detection": {
"primary": "Recall",
"reason": "Catch all fraud cases"
},
"balanced_classification": {
"primary": "F1-Score",
"reason": "Balance Precision and Recall"
}
}
return recommendations.get(task_type, {"primary": "Accuracy", "reason": "Default"})
# Examples
print(recommend_metric("spam_filter"))
print(recommend_metric("disease_detection"))
Improving Metrics
Ways to boost quality:
1. More data
def improve_with_data(current_f1, data_increase_percent):
"""Improvement from adding more data."""
# Simplified formula
improvement = data_increase_percent * 0.001
new_f1 = min(0.99, current_f1 + improvement)
return new_f1
f1 = 0.75
new_f1 = improve_with_data(f1, 50) # +50% data
print(f"F1: {f1:.2%} → {new_f1:.2%}")
2. Class balancing
def balance_classes(dataset):
"""Equalize class sizes."""
positive = [d for d in dataset if d["label"] == 1]
negative = [d for d in dataset if d["label"] == 0]
# Use the smaller class size
min_size = min(len(positive), len(negative))
balanced = positive[:min_size] + negative[:min_size]
return balanced
3. Threshold tuning
def adjust_threshold(scores, actuals, target_precision=0.90):
"""Find the threshold for a target precision."""
thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
for threshold in thresholds:
predictions = [1 if score >= threshold else 0 for score in scores]
tp = sum(1 for p, a in zip(predictions, actuals) if p == 1 and a == 1)
fp = sum(1 for p, a in zip(predictions, actuals) if p == 1 and a == 0)
precision = tp / (tp + fp) if tp + fp > 0 else 0
if precision >= target_precision:
return threshold
return 0.5 # Default
# Example
scores = [0.9, 0.7, 0.4, 0.8, 0.3]
actuals = [1, 1, 0, 1, 0]
best_threshold = adjust_threshold(scores, actuals, target_precision=0.9)
print(f"Best threshold: {best_threshold}")
Common Mistakes
❌ Mistake 1: Using only Accuracy
# BAD: imbalanced classes (95% negative, 5% positive)
# A model that always predicts "negative" → Accuracy = 95%!
# But Recall = 0% (it never finds any positives)
# ✅ GOOD: look at F1-Score
evaluator = ModelEvaluator()
# ... evaluate ...
print(f"F1-Score: {evaluator.get_f1_score():.2%}")
❌ Mistake 2: Ignoring context
# BAD: using the same metric for every task
all_tasks_use_accuracy()
# ✅ GOOD: match the metric to the task
if task == "medical":
use_recall() # Recall matters!
elif task == "spam":
use_precision() # Precision matters!
❌ Mistake 3: Not testing on new data
# BAD: testing on training data
train_model(train_data)
test_on_same_data(train_data) # Overfitting!
# ✅ GOOD: separate test set
train_model(train_data)
test_on_new_data(test_data) # Honest evaluation
Summary
Core metrics:
metrics = {
"Accuracy": "all correct / all",
"Precision": "TP / (TP + FP)",
"Recall": "TP / (TP + FN)",
"F1-Score": "2 * P * R / (P + R)"
}
Confusion Matrix:
Predicted
Pos Neg
Actual
Pos TP FN
Neg FP TN
When to use each:
- Accuracy — balanced classes
- Precision — avoid FP (false alarms)
- Recall — avoid FN (missed cases)
- F1-Score — balance Precision and Recall
What’s Next?
Now you know AI model metrics! 🎉
Next topics:
- Monetization — API pricing, subscriptions
- Investment — funding rounds
- Competition — market analysis
Build a model with a high F1-Score! 📊🚀
💬 Comments (0)
No comments yet
Be the first to share your opinion about this article!