AGI Strategy - Day 7

Day 7 Detailed Plan: Baseline Analysis Link to heading

Context from prior days: You have a complete, verified baseline results file from Day 6. Today is the first pure analysis day — no generation, no pipeline work. The goal is to characterise your baseline thoroughly enough that you have a clear, documented picture of model behaviour without any safety mitigation. This becomes the reference point for every comparison in Weeks 2 and 3, so the quality of this analysis directly affects the quality of your final writeup.

Session Structure (2 hours) Link to heading

Block 1 — Set up your analysis environment (15 min) Link to heading

Create a dedicated analysis script or notebook for Day 7. A Jupyter notebook is reasonable here given the exploratory nature of the work and the need to produce figures — just ensure you also maintain a clean .py version for reproducibility.

Install any missing dependencies:

pip install matplotlib seaborn pandas scipy

Load your baseline data into a pandas DataFrame at the outset, as it will make grouping and aggregation significantly more concise than working with raw lists of dicts:

import json
import pandas as pd

with open("results/baseline_llama31_8b_2026-03-23.jsonl") as f:
    records = [json.loads(line) for line in f]

# Flatten toxicity_scores dict into top-level columns
for r in records:
    if r["toxicity_scores"]:
        for k, v in r["toxicity_scores"].items():
            r[k] = v
    else:
        for k in ["toxicity", "severe_toxicity", "obscene", "threat", "insult", "identity_attack"]:
            r[k] = None

df = pd.DataFrame(records)
df_scored = df.dropna(subset=["toxicity"]).copy()

print(f"Total records: {len(df)}")
print(f"Scored records: {len(df_scored)}")
print(f"Null rate: {1 - len(df_scored)/len(df):.2%}")

Document the null rate immediately. If it exceeds 2%, note it prominently — it will need to be addressed in your methodology section.

Block 2 — Aggregate statistics (30-35 min) Link to heading

Work through statistics at three levels: overall, per subscore, and per demographic group. Keep all outputs as variables rather than just printing them — you will want to reference them in your writeup.

Overall toxicity statistics

from config import TOXICITY_THRESHOLD

summary = df_scored["toxicity"].describe()
above_threshold = (df_scored["toxicity"] > TOXICITY_THRESHOLD).mean()
above_threshold_n = (df_scored["toxicity"] > TOXICITY_THRESHOLD).sum()

print(summary)
print(f"\nProportion above {TOXICITY_THRESHOLD}: {above_threshold:.3f} ({above_threshold_n}/{len(df_scored)})")

All subscores side by side

subscores = ["toxicity", "severe_toxicity", "obscene", "threat", "insult", "identity_attack"]

subscore_summary = df_scored[subscores].agg(["mean", "median", "std", "max"]).T
subscore_summary["above_threshold"] = (df_scored[subscores] > TOXICITY_THRESHOLD).mean()
print(subscore_summary.round(3))

This table is directly usable in your writeup. Look at the relationship between toxicity and identity_attack in particular — given ToxiGen’s focus, a high identity_attack rate relative to general toxicity would be a meaningful finding worth highlighting.

Per demographic group

group_stats = df_scored.groupby("demographic_group")["toxicity"].agg(
    n="count",
    mean="mean",
    median="median",
    std="std",
    above_threshold=lambda x: (x > TOXICITY_THRESHOLD).mean()
).round(3).sort_values("mean", ascending=False)

print(group_stats)

Sort by mean toxicity descending so the highest-risk groups are immediately visible. Note which groups sit at the extremes — this will directly inform your Day 13 extension analysis if you choose Option A.

Also check whether your sample is balanced across groups:

print(df_scored["demographic_group"].value_counts())

If any group has substantially fewer records than others due to the Day 2 sampling, note it — it affects the reliability of per-group comparisons and should be acknowledged in your writeup.

Block 3 — Visualisations (35-40 min) Link to heading

Produce four figures. Save each to figures/ with a clear filename. These are the publication-quality figures referenced in the Day 11 plan, so treat formatting as a first-class concern now rather than revisiting it later.

Figure 1: Overall toxicity score distribution

import matplotlib.pyplot as plt
import seaborn as sns

fig, ax = plt.subplots(figsize=(8, 4))
sns.histplot(df_scored["toxicity"], bins=40, ax=ax, color="steelblue", edgecolor="white")
ax.axvline(TOXICITY_THRESHOLD, color="red", linestyle="--", linewidth=1.2, label=f"Threshold ({TOXICITY_THRESHOLD})")
ax.set_xlabel("Toxicity Score")
ax.set_ylabel("Count")
ax.set_title("Distribution of Baseline Toxicity Scores")
ax.legend()
plt.tight_layout()
plt.savefig("figures/baseline_toxicity_distribution.png", dpi=150)
plt.show()

Note the shape of the distribution. A heavily right-skewed distribution (most scores near zero, long tail of high scores) is typical and means the mean will be influenced by a relatively small number of highly toxic outputs. This is relevant to your choice of statistical test on Day 11 — a non-parametric test like Wilcoxon may be more appropriate than a t-test if the distribution is far from normal.

Figure 2: Per-group mean toxicity with error bars

group_plot = df_scored.groupby("demographic_group")["toxicity"].agg(["mean", "sem"]).reset_index()
group_plot = group_plot.sort_values("mean", ascending=True)

fig, ax = plt.subplots(figsize=(8, 6))
ax.barh(group_plot["demographic_group"], group_plot["mean"],
        xerr=group_plot["sem"], color="steelblue", ecolor="gray",
        capsize=3, edgecolor="white")
ax.axvline(TOXICITY_THRESHOLD, color="red", linestyle="--", linewidth=1.2, label=f"Threshold ({TOXICITY_THRESHOLD})")
ax.set_xlabel("Mean Toxicity Score")
ax.set_title("Baseline Mean Toxicity by Demographic Group")
ax.legend()
plt.tight_layout()
plt.savefig("figures/baseline_toxicity_by_group.png", dpi=150)
plt.show()

Figure 3: Subscore comparison

subscore_means = df_scored[subscores].mean().sort_values(ascending=True)

fig, ax = plt.subplots(figsize=(7, 4))
subscore_means.plot(kind="barh", ax=ax, color="steelblue", edgecolor="white")
ax.set_xlabel("Mean Score")
ax.set_title("Baseline Mean Scores Across Toxicity Dimensions")
plt.tight_layout()
plt.savefig("figures/baseline_subscore_comparison.png", dpi=150)
plt.show()

Figure 4: Toxicity score distribution per group (box plot)

group_order = (df_scored.groupby("demographic_group")["toxicity"]
               .mean().sort_values(ascending=False).index.tolist())

fig, ax = plt.subplots(figsize=(8, 6))
df_scored.boxplot(column="toxicity", by="demographic_group",
                  ax=ax, vert=False, order=group_order)
ax.axvline(TOXICITY_THRESHOLD, color="red", linestyle="--", linewidth=1.2)
ax.set_xlabel("Toxicity Score")
ax.set_ylabel("Demographic Group")
ax.set_title("Baseline Toxicity Score Distribution by Group")
plt.suptitle("")  # suppress pandas default title
plt.tight_layout()
plt.savefig("figures/baseline_toxicity_boxplot_by_group.png", dpi=150)
plt.show()

The box plot is more informative than the bar chart for understanding spread and outliers within each group, and is worth including alongside it.

Block 4 — Qualitative review of high-toxicity outputs (20-25 min) Link to heading

Identify and manually read through the highest-scoring completions:

top_toxic = df_scored.nlargest(20, "toxicity")[
    ["prompt_id", "demographic_group", "prompt", "completion", "toxicity", "identity_attack"]
]

for _, row in top_toxic.iterrows():
    print(f"ID: {row['prompt_id']} | Group: {row['demographic_group']}")
    print(f"Toxicity: {row['toxicity']:.3f} | Identity Attack: {row['identity_attack']:.3f}")
    print(f"Prompt: {row['prompt'][:100]}")
    print(f"Completion: {row['completion'][:200]}")
    print("---")

Also look at the bottom — completions that scored near zero despite seemingly toxic prompts:

low_toxic = df_scored.nsmallest(10, "toxicity")[
    ["prompt_id", "demographic_group", "prompt", "completion", "toxicity"]
]

For both sets, take notes on:

What patterns make completions score high — is it explicit slurs, generalisations, or something more subtle?
What causes low scores — is the model deflecting, giving neutral completions, or is the scorer potentially missing something?
Whether any high-scoring completions seem miscalibrated (i.e. the text does not read as particularly toxic but scored high)

These observations feed directly into Days 12 and 14. Record them in notes/day7_observations.md with specific examples referenced by prompt_id so you can retrieve them later.

Block 5 — Document and consolidate (15 min) Link to heading

Compile the key numbers into a single summary structure that you can reference throughout the rest of the project:

baseline_summary = {
    "model": df_scored["model"].iloc[0],
    "n_total": len(df),
    "n_scored": len(df_scored),
    "null_rate": 1 - len(df_scored) / len(df),
    "mean_toxicity": df_scored["toxicity"].mean(),
    "median_toxicity": df_scored["toxicity"].median(),
    "std_toxicity": df_scored["toxicity"].std(),
    "proportion_above_threshold": above_threshold,
    "threshold": TOXICITY_THRESHOLD,
    "mean_identity_attack": df_scored["identity_attack"].mean(),
    "n_demographic_groups": df_scored["demographic_group"].nunique(),
}

with open("results/baseline_summary.json", "w") as f:
    json.dump(baseline_summary, f, indent=2)

print(json.dumps(baseline_summary, indent=2))

Saving this as a standalone JSON file means your Week 3 analysis scripts can load it directly rather than recomputing from the raw data.

Deliverables Checklist Link to heading

figures/baseline_toxicity_distribution.png
figures/baseline_toxicity_by_group.png
figures/baseline_subscore_comparison.png
figures/baseline_toxicity_boxplot_by_group.png
results/baseline_summary.json with key aggregate statistics
notes/day7_observations.md with qualitative findings and referenced examples
Per-group statistics table saved or printed and recorded

Connection to the rest of the plan Link to heading

Two things from today carry significant weight later. First, the per-group statistics will anchor your Day 13 extension analysis if you choose Option A — any groups that stand out today are the ones worth examining closely after mitigation. Second, the shape of the toxicity score distribution informs the statistical test you choose on Day 11. If the distribution is clearly non-normal — which is likely given the adversarial nature of ToxiGen prompts — note that now so the decision on Day 11 is already half-made. Wilcoxon signed-rank is the safer default over a paired t-test in that scenario.