📝 AI Startup

AI Model Metrics — Measure Your Success! 📊

0
Author
04e5cc8b-58ac-4bdc-bdee-661bbb
📅
Published
03.04.2026
⏱️
Reading time
8 min
👁️
Views
99
🌱
Level
Beginner

Why Do Metrics Matter?

Metrics are how you measure how well an AI model is performing.

Simple analogy:

A student takes an exam:
  Grade = success metric

An AI model makes predictions:
  Accuracy, Precision, Recall = quality metrics

Without metrics, you can’t tell if your model is working! 📈


Core Metrics

1. Accuracy

The percentage of correct answers out of all predictions.

def calculate_accuracy(correct, total):
    """Accuracy = correct / total."""
    if total == 0:
        return 0
    return (correct / total) * 100

# Example
correct = 85  # 85 correct
total = 100   # out of 100 total
accuracy = calculate_accuracy(correct, total)
print(f"Accuracy: {accuracy}%")  # 85%

When to use:
- ✅ Classes are balanced (50/50)
- ❌ Classes are imbalanced (95/5)


Confusion Matrix

The foundation of all classification metrics!

Reality
          Positive  Negative
Model
Positive    TP        FP       (True/False Positive)
Negative    FN        TN       (False/True Negative)
  • TP (True Positive) — correctly identified positives
  • TN (True Negative) — correctly identified negatives
  • FP (False Positive) — false alarm
  • FN (False Negative) — missed positives

Example: spam detector

def create_confusion_matrix(predictions, actuals):
    """Build a confusion matrix."""
    tp = sum(1 for p, a in zip(predictions, actuals)
             if p == 'spam' and a == 'spam')
    tn = sum(1 for p, a in zip(predictions, actuals)
             if p == 'not_spam' and a == 'not_spam')
    fp = sum(1 for p, a in zip(predictions, actuals)
             if p == 'spam' and a == 'not_spam')
    fn = sum(1 for p, a in zip(predictions, actuals)
             if p == 'not_spam' and a == 'spam')

    return {
        "TP": tp,  # Spam correctly identified
        "TN": tn,  # Non-spam correctly identified
        "FP": fp,  # Legit email flagged as spam
        "FN": fn   # Spam that got through
    }

# Test
predictions = ['spam', 'not_spam', 'spam', 'spam', 'not_spam']
actuals = ['spam', 'not_spam', 'not_spam', 'spam', 'not_spam']

matrix = create_confusion_matrix(predictions, actuals)
print(matrix)
# {'TP': 2, 'TN': 2, 'FP': 1, 'FN': 0}

Precision

Of all “positive” predictions, how many were actually correct?

def calculate_precision(tp, fp):
    """Precision = TP / (TP + FP)."""
    if tp + fp == 0:
        return 0
    return tp / (tp + fp)

# Example
tp = 80  # 80 correctly identified spam emails
fp = 20  # 20 legit emails flagged as spam

precision = calculate_precision(tp, fp)
print(f"Precision: {precision:.2%}")  # 80%

When does Precision matter?

  • 🚨 Medical diagnosis — false positives are costly
  • 📧 Spam filter — avoid deleting legitimate emails
  • ⚖️ Legal decisions — false accusations are expensive

Precision answers: Can we trust positive predictions?


Recall

Of all real positives, how many did we find?

def calculate_recall(tp, fn):
    """Recall = TP / (TP + FN)."""
    if tp + fn == 0:
        return 0
    return tp / (tp + fn)

# Example
tp = 80  # 80 spam emails found
fn = 20  # 20 spam emails missed

recall = calculate_recall(tp, fn)
print(f"Recall: {recall:.2%}")  # 80%

When does Recall matter?

  • 🔍 Disease screening — can’t miss a diagnosis
  • 🛡️ Fraud detection — need to catch every case
  • 🔐 Security — better safe than sorry

Recall answers: Are we finding all positive cases?


Precision vs Recall

The Trade-off

def demonstrate_tradeoff():
    """Illustrate the Precision/Recall trade-off."""

    # Strict model (few FP, but many FN)
    strict_model = {
        "TP": 60, "FP": 5, "FN": 40, "TN": 95,
        "precision": 0.92,  # High!
        "recall": 0.60      # Low!
    }

    # Lenient model (few FN, but many FP)
    lenient_model = {
        "TP": 95, "FP": 30, "FN": 5, "TN": 70,
        "precision": 0.76,  # Low!
        "recall": 0.95      # High!
    }

    return strict_model, lenient_model

strict, lenient = demonstrate_tradeoff()

print("Strict model:")
print(f"  Precision: {strict['precision']:.2%}")
print(f"  Recall: {strict['recall']:.2%}")

print("\nLenient model:")
print(f"  Precision: {lenient['precision']:.2%}")
print(f"  Recall: {lenient['recall']:.2%}")

You can’t maximize both at the same time!


F1-Score (Harmonic Mean)

A balance between Precision and Recall.

def calculate_f1_score(precision, recall):
    """F1 = 2 * (Precision * Recall) / (Precision + Recall)."""
    if precision + recall == 0:
        return 0
    return 2 * (precision * recall) / (precision + recall)

# Example
precision = 0.80  # 80%
recall = 0.75     # 75%

f1 = calculate_f1_score(precision, recall)
print(f"F1-Score: {f1:.2%}")  # 77.42%

When to use F1?

  • ✅ You need a balance between Precision and Recall
  • ✅ Classes are imbalanced
  • ✅ Both error types (FP and FN) matter

F1 = 1 — perfect (Precision = Recall = 100%)
F1 = 0 — model finds nothing


Full Example: Model Evaluation

class ModelEvaluator:
    """Class for evaluating an AI model."""

    def __init__(self):
        self.tp = 0
        self.tn = 0
        self.fp = 0
        self.fn = 0

    def evaluate(self, predictions, actuals):
        """Evaluate predictions."""
        for pred, actual in zip(predictions, actuals):
            if pred == 1 and actual == 1:
                self.tp += 1
            elif pred == 0 and actual == 0:
                self.tn += 1
            elif pred == 1 and actual == 0:
                self.fp += 1
            else:  # pred == 0 and actual == 1
                self.fn += 1

    def get_accuracy(self):
        """Accuracy."""
        total = self.tp + self.tn + self.fp + self.fn
        if total == 0:
            return 0
        return (self.tp + self.tn) / total

    def get_precision(self):
        """Precision."""
        if self.tp + self.fp == 0:
            return 0
        return self.tp / (self.tp + self.fp)

    def get_recall(self):
        """Recall."""
        if self.tp + self.fn == 0:
            return 0
        return self.tp / (self.tp + self.fn)

    def get_f1_score(self):
        """F1-Score."""
        precision = self.get_precision()
        recall = self.get_recall()

        if precision + recall == 0:
            return 0
        return 2 * (precision * recall) / (precision + recall)

    def get_report(self):
        """Full report."""
        return {
            "confusion_matrix": {
                "TP": self.tp,
                "TN": self.tn,
                "FP": self.fp,
                "FN": self.fn
            },
            "metrics": {
                "accuracy": f"{self.get_accuracy():.2%}",
                "precision": f"{self.get_precision():.2%}",
                "recall": f"{self.get_recall():.2%}",
                "f1_score": f"{self.get_f1_score():.2%}"
            }
        }

# Usage
evaluator = ModelEvaluator()

# Test data
predictions = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
actuals = [1, 0, 0, 1, 0, 1, 1, 0, 1, 0]

evaluator.evaluate(predictions, actuals)

# Report
report = evaluator.get_report()
print("📊 Model report:")
print(f"\nConfusion matrix:")
for key, value in report["confusion_matrix"].items():
    print(f"  {key}: {value}")

print(f"\nMetrics:")
for key, value in report["metrics"].items():
    print(f"  {key}: {value}")

Additional Metrics

Specificity

Of all negatives, how many did we correctly identify?

def calculate_specificity(tn, fp):
    """Specificity = TN / (TN + FP)."""
    if tn + fp == 0:
        return 0
    return tn / (tn + fp)

# Example: non-spam correctly identified
tn = 90  # 90 legit emails correctly classified
fp = 10  # 10 legit emails flagged as spam

specificity = calculate_specificity(tn, fp)
print(f"Specificity: {specificity:.2%}")  # 90%

ROC AUC (Area Under the Curve)

def calculate_roc_auc_simple(tpr, fpr):
    """Simplified AUC (real calculation is more complex)."""
    # True Positive Rate vs False Positive Rate
    auc = (1 - fpr + tpr) / 2
    return auc

tpr = 0.85  # True Positive Rate (Recall)
fpr = 0.10  # False Positive Rate

auc = calculate_roc_auc_simple(tpr, fpr)
print(f"AUC: {auc:.2%}")  # ~87.5%

AUC = 1.0 — perfect model
AUC = 0.5 — random guessing


Choosing the Right Metric

Decision table:

Task Primary metric Why
Spam filter Precision Avoid deleting legit emails
Disease detection Recall Can’t miss a diagnosis
Product recommendations F1-Score Balance precision and coverage
Face recognition Accuracy Classes are balanced
Fraud detection Recall Catch every case

Code for metric selection:

def recommend_metric(task_type):
    """Recommend a metric for a given task."""
    recommendations = {
        "spam_filter": {
            "primary": "Precision",
            "reason": "Can't delete legit emails"
        },
        "disease_detection": {
            "primary": "Recall",
            "reason": "Can't miss a disease"
        },
        "fraud_detection": {
            "primary": "Recall",
            "reason": "Catch all fraud cases"
        },
        "balanced_classification": {
            "primary": "F1-Score",
            "reason": "Balance Precision and Recall"
        }
    }

    return recommendations.get(task_type, {"primary": "Accuracy", "reason": "Default"})

# Examples
print(recommend_metric("spam_filter"))
print(recommend_metric("disease_detection"))

Improving Metrics

Ways to boost quality:

1. More data

def improve_with_data(current_f1, data_increase_percent):
    """Improvement from adding more data."""
    # Simplified formula
    improvement = data_increase_percent * 0.001
    new_f1 = min(0.99, current_f1 + improvement)
    return new_f1

f1 = 0.75
new_f1 = improve_with_data(f1, 50)  # +50% data
print(f"F1: {f1:.2%} → {new_f1:.2%}")

2. Class balancing

def balance_classes(dataset):
    """Equalize class sizes."""
    positive = [d for d in dataset if d["label"] == 1]
    negative = [d for d in dataset if d["label"] == 0]

    # Use the smaller class size
    min_size = min(len(positive), len(negative))

    balanced = positive[:min_size] + negative[:min_size]
    return balanced

3. Threshold tuning

def adjust_threshold(scores, actuals, target_precision=0.90):
    """Find the threshold for a target precision."""
    thresholds = [0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

    for threshold in thresholds:
        predictions = [1 if score >= threshold else 0 for score in scores]

        tp = sum(1 for p, a in zip(predictions, actuals) if p == 1 and a == 1)
        fp = sum(1 for p, a in zip(predictions, actuals) if p == 1 and a == 0)

        precision = tp / (tp + fp) if tp + fp > 0 else 0

        if precision >= target_precision:
            return threshold

    return 0.5  # Default

# Example
scores = [0.9, 0.7, 0.4, 0.8, 0.3]
actuals = [1, 1, 0, 1, 0]
best_threshold = adjust_threshold(scores, actuals, target_precision=0.9)
print(f"Best threshold: {best_threshold}")

Common Mistakes

❌ Mistake 1: Using only Accuracy

# BAD: imbalanced classes (95% negative, 5% positive)
# A model that always predicts "negative" → Accuracy = 95%!
# But Recall = 0% (it never finds any positives)

# ✅ GOOD: look at F1-Score
evaluator = ModelEvaluator()
# ... evaluate ...
print(f"F1-Score: {evaluator.get_f1_score():.2%}")

❌ Mistake 2: Ignoring context

# BAD: using the same metric for every task
all_tasks_use_accuracy()

# ✅ GOOD: match the metric to the task
if task == "medical":
    use_recall()  # Recall matters!
elif task == "spam":
    use_precision()  # Precision matters!

❌ Mistake 3: Not testing on new data

# BAD: testing on training data
train_model(train_data)
test_on_same_data(train_data)  # Overfitting!

# ✅ GOOD: separate test set
train_model(train_data)
test_on_new_data(test_data)  # Honest evaluation

Summary

Core metrics:

metrics = {
    "Accuracy": "all correct / all",
    "Precision": "TP / (TP + FP)",
    "Recall": "TP / (TP + FN)",
    "F1-Score": "2 * P * R / (P + R)"
}

Confusion Matrix:

         Predicted
         Pos   Neg
Actual
Pos      TP    FN
Neg      FP    TN

When to use each:

  • Accuracy — balanced classes
  • Precision — avoid FP (false alarms)
  • Recall — avoid FN (missed cases)
  • F1-Score — balance Precision and Recall

What’s Next?

Now you know AI model metrics! 🎉

Next topics:
- Monetization — API pricing, subscriptions
- Investment — funding rounds
- Competition — market analysis

Build a model with a high F1-Score! 📊🚀

Your reaction to the article

💬 Comments (0)

🔐 Sign in to leave a comment
🚪 Login
💭

No comments yet

Be the first to share your opinion about this article!

🔗 Similar

Similar articles

Continue learning with these materials

📝

AI Startup Basics — Build Your Own Thing! 🤖💡

An AI startup is a young company that builds a product powered by artificial intelligence.

📅 03.04.2026 👁️ 123
📝

AI Model Basics — Your Smart Model! 🧠

An AI model is a program that learns from data and makes predictions.

📅 03.04.2026 👁️ 94
📝

AI Startup Competition — Outrun Your Rivals! 🏆

Competition refers to other companies solving the same problem.

📅 03.04.2026 👁️ 94

Did you like the article?

Subscribe to our updates and receive new articles first. Grow with PyLand!