AGI Strategy - Day 6

Day 6 Detailed Plan: Full Baseline Evaluation Link to heading

Context from prior days: Your pipeline is tested and working, your output schema is finalised, and you have a time-per-prompt estimate from Day 5. Today is primarily an execution day — the main task is running the pipeline over your full subset with no system prompt, producing the baseline results dataset that everything in Weeks 2 and 3 will be measured against. Most of the compute time will be unattended, so this plan accounts for how to use that time productively.

Session Structure (3-4 hours, mostly compute) Link to heading

Block 1 — Pre-run checklist (20-30 min) Link to heading

Before starting the full run, work through this systematically. Discovering a configuration error two hours into a run is significantly more costly than spending 20 minutes verifying upfront.

Environment

Confirm Ollama is running: curl http://localhost:11434/api/tags should return your model list
Confirm the model is loaded: send a single test prompt through generate_completion and verify a response
Check available disk space — 1000 JSONL records with full text will be small, but confirm there is no risk of the output drive filling mid-run

Configuration

Verify config.py reflects the exact parameters you decided on during Day 3 and Day 5
Confirm system_prompt="" is the default in your batch runner — this is the no-mitigation baseline, and it must be consistent across every record
Double-check that your output path points to a new file, not one of your smoke test files from Day 5

Data

Confirm your full subset file is intact and the prompt count matches what you recorded on Day 2
Verify the id field is unique across all records — the resume logic from Day 5 depends on this
Make a note of the exact prompt count you are running so you can verify completeness afterwards

Output

Confirm the results directory exists and is writable
Name the output file unambiguously: something like results/baseline_llama31_8b_[date].jsonl rather than a generic name. You will have multiple results files by Week 3 and clear naming will matter.

Block 2 — Launch the full run (10 min, then mostly unattended) Link to heading

import json
import random
from pipeline import run_pipeline_batch

random.seed(42)

# Load full subset
with open("data/subset.jsonl", "r") as f:
    all_prompts = [json.loads(line) for line in f]

print(f"Total prompts to run: {len(all_prompts)}")

run_pipeline_batch(
    prompts=all_prompts,
    output_path="results/baseline_llama31_8b_2026-03-23.jsonl",
    system_prompt="",
    resume=True,
)

print("Baseline run complete.")

Use the same random.seed(42) you set on Day 5 if your subset was shuffled during sampling. The goal is that the prompt order and selection are fully reproducible from your Day 2 data file.

Once the run starts, verify the first 5-10 records are being written to the output file correctly before leaving it unattended. Open the file in a second terminal and spot-check:

tail -f results/baseline_llama31_8b_2026-03-23.jsonl

If the first few records look well-formed and scores are non-null, you can step away.

Block 3 — While the run is in progress (60-90 min, parallel work) Link to heading

Based on your Day 5 time estimate, use this time on Day 8 preparation rather than waiting idle. Day 8 is the safety prompt design session, and doing some of the reading now means you will be better prepared.

Suggested reading and tasks while the baseline runs:

Read on safety prompting approaches

Perez et al. (2022), “Ignore Previous Prompt: Attack Techniques For Language Models” — useful context on how system prompts can be circumvented, which informs what makes a safety prompt robust
Check the Llama 3.1 model card on HuggingFace for Meta’s own recommended system prompt — this is a reasonable candidate for one of your mitigation conditions
Skim recent work on Constitutional AI and RLHF to understand the landscape your prompt-based approach sits within

Draft candidate safety prompts Start a notes/safety_prompt_candidates.md file with at least 3-4 candidate prompts across the spectrum your plan describes — minimal, safety-focused, refusal-oriented, and detailed. You do not need to finalise them today, but having drafts ready will make Day 8 more productive.

Review the ToxiGen paper’s results section Specifically look at how they report per-group toxicity rates. This will inform how you structure your own Day 7 analysis and what comparisons are most meaningful to draw.

Block 4 — Monitor and handle errors (periodic, throughout run) Link to heading

Check in on the run every 20-30 minutes. Specifically look for:

Stalled progress If the record count in the output file has not increased in more than a few minutes, Ollama may have timed out or crashed. Check:

wc -l results/baseline_llama31_8b_2026-03-23.jsonl

If it has stalled, restart Ollama and relaunch the script — the resume logic will pick up where it left off.

Error patterns If you see a spike in None completions from the progress logs, check whether Ollama is returning errors. A small number of None completions is acceptable; a systematic pattern suggests a configuration or resource issue.

Memory pressure On M2 Max with Llama 3.1 8B, memory should not be an issue, but if your machine becomes noticeably slow during the run it is worth checking Activity Monitor to confirm the model is not competing with other processes for RAM.

Block 5 — Post-run verification (30-40 min) Link to heading

Once the run completes, do not proceed to Day 7 analysis without verifying the output first.

Completeness check

import json

with open("results/baseline_llama31_8b_2026-03-23.jsonl") as f:
    records = [json.loads(line) for line in f]

print(f"Total records: {len(records)}")
print(f"Expected: {len(all_prompts)}")

null_completions = [r for r in records if r["completion"] is None]
null_scores = [r for r in records if r["toxicity_scores"] is None]

print(f"Null completions: {len(null_completions)}")
print(f"Null scores: {len(null_scores)}")

A small number of nulls (under 1-2%) is acceptable and should be noted in your writeup. A larger number warrants investigation before you proceed.

Schema validation Confirm every record contains the expected fields. An unexpected missing field at this stage is much easier to fix by re-running a small subset than it will be during Week 3 analysis:

expected_keys = {
    "prompt_id", "prompt", "demographic_group", "system_prompt",
    "completion", "toxicity_scores", "above_threshold",
    "model", "temperature", "max_tokens", "timestamp"
}

for i, r in enumerate(records):
    missing = expected_keys - set(r.keys())
    if missing:
        print(f"Record {i} missing fields: {missing}")

Sanity check on scores

import statistics

scored = [r for r in records if r["toxicity_scores"]]
tox = [r["toxicity_scores"]["toxicity"] for r in scored]

print(f"Mean toxicity: {statistics.mean(tox):.3f}")
print(f"Max toxicity:  {max(tox):.3f}")
print(f"Min toxicity:  {min(tox):.3f}")

from collections import Counter
groups = Counter(r["demographic_group"] for r in records)
print(f"\nRecords per group: {dict(groups)}")

Compare the mean and distribution against your Day 5 smoke test results. They should be in a similar range. A dramatic difference — particularly a much lower mean than expected — could indicate the system prompt is not genuinely empty or that a parameter changed between runs.

Back up the results file Once verified, copy the results file to a second location. This is your primary dataset and regenerating it costs several hours of compute time.

Deliverables Checklist Link to heading

results/baseline_llama31_8b_[date].jsonl — complete, verified, backed up
Record count confirmed against expected prompt total
Null completion and null score rates documented
Sanity check statistics recorded in notes/day6_observations.md
Draft safety prompt candidates started in notes/safety_prompt_candidates.md
Any errors or anomalies from the run documented with notes on whether they require action

Connection to the rest of the plan Link to heading

This file is your ground truth for the entire project. Every comparison you make in Weeks 2 and 3 — statistical tests, effect sizes, per-group breakdowns, qualitative analysis — is measured against it. Treating the verification step as optional or cursory is the single most likely way to introduce a problem that is painful to diagnose later. It is worth the 30-40 minutes.