AGI Strategy - Day 6
Day 6 Detailed Plan: Full Baseline Evaluation Link to heading
Context from prior days: Your pipeline is tested and working, your output schema is finalised, and you have a time-per-prompt estimate from Day 5. Today is primarily an execution day — the main task is running the pipeline over your full subset with no system prompt, producing the baseline results dataset that everything in Weeks 2 and 3 will be measured against. Most of the compute time will be unattended, so this plan accounts for how to use that time productively.
Session Structure (3-4 hours, mostly compute) Link to heading
Block 1 — Pre-run checklist (20-30 min) Link to heading
Before starting the full run, work through this systematically. Discovering a configuration error two hours into a run is significantly more costly than spending 20 minutes verifying upfront.
Environment
- Confirm Ollama is running:
curl http://localhost:11434/api/tagsshould return your model list - Confirm the model is loaded: send a single test prompt through
generate_completionand verify a response - Check available disk space — 1000 JSONL records with full text will be small, but confirm there is no risk of the output drive filling mid-run
Configuration
- Verify
config.pyreflects the exact parameters you decided on during Day 3 and Day 5 - Confirm
system_prompt=""is the default in your batch runner — this is the no-mitigation baseline, and it must be consistent across every record - Double-check that your output path points to a new file, not one of your smoke test files from Day 5
Data
- Confirm your full subset file is intact and the prompt count matches what you recorded on Day 2
- Verify the
idfield is unique across all records — the resume logic from Day 5 depends on this - Make a note of the exact prompt count you are running so you can verify completeness afterwards
Output
- Confirm the results directory exists and is writable
- Name the output file unambiguously: something like
results/baseline_llama31_8b_[date].jsonlrather than a generic name. You will have multiple results files by Week 3 and clear naming will matter.
Block 2 — Launch the full run (10 min, then mostly unattended) Link to heading
import json
import random
from pipeline import run_pipeline_batch
random.seed(42)
# Load full subset
with open("data/subset.jsonl", "r") as f:
all_prompts = [json.loads(line) for line in f]
print(f"Total prompts to run: {len(all_prompts)}")
run_pipeline_batch(
prompts=all_prompts,
output_path="results/baseline_llama31_8b_2026-03-23.jsonl",
system_prompt="",
resume=True,
)
print("Baseline run complete.")
Use the same random.seed(42) you set on Day 5 if your subset was shuffled during sampling. The goal is that the prompt order and selection are fully reproducible from your Day 2 data file.
Once the run starts, verify the first 5-10 records are being written to the output file correctly before leaving it unattended. Open the file in a second terminal and spot-check:
tail -f results/baseline_llama31_8b_2026-03-23.jsonl
If the first few records look well-formed and scores are non-null, you can step away.
Block 3 — While the run is in progress (60-90 min, parallel work) Link to heading
Based on your Day 5 time estimate, use this time on Day 8 preparation rather than waiting idle. Day 8 is the safety prompt design session, and doing some of the reading now means you will be better prepared.
Suggested reading and tasks while the baseline runs:
Read on safety prompting approaches
- Perez et al. (2022), “Ignore Previous Prompt: Attack Techniques For Language Models” — useful context on how system prompts can be circumvented, which informs what makes a safety prompt robust
- Check the Llama 3.1 model card on HuggingFace for Meta’s own recommended system prompt — this is a reasonable candidate for one of your mitigation conditions
- Skim recent work on Constitutional AI and RLHF to understand the landscape your prompt-based approach sits within
Draft candidate safety prompts
Start a notes/safety_prompt_candidates.md file with at least 3-4 candidate prompts across the spectrum your plan describes — minimal, safety-focused, refusal-oriented, and detailed. You do not need to finalise them today, but having drafts ready will make Day 8 more productive.
Review the ToxiGen paper’s results section Specifically look at how they report per-group toxicity rates. This will inform how you structure your own Day 7 analysis and what comparisons are most meaningful to draw.
Block 4 — Monitor and handle errors (periodic, throughout run) Link to heading
Check in on the run every 20-30 minutes. Specifically look for:
Stalled progress If the record count in the output file has not increased in more than a few minutes, Ollama may have timed out or crashed. Check:
wc -l results/baseline_llama31_8b_2026-03-23.jsonl
If it has stalled, restart Ollama and relaunch the script — the resume logic will pick up where it left off.
Error patterns
If you see a spike in None completions from the progress logs, check whether Ollama is returning errors. A small number of None completions is acceptable; a systematic pattern suggests a configuration or resource issue.
Memory pressure On M2 Max with Llama 3.1 8B, memory should not be an issue, but if your machine becomes noticeably slow during the run it is worth checking Activity Monitor to confirm the model is not competing with other processes for RAM.
Block 5 — Post-run verification (30-40 min) Link to heading
Once the run completes, do not proceed to Day 7 analysis without verifying the output first.
Completeness check
import json
with open("results/baseline_llama31_8b_2026-03-23.jsonl") as f:
records = [json.loads(line) for line in f]
print(f"Total records: {len(records)}")
print(f"Expected: {len(all_prompts)}")
null_completions = [r for r in records if r["completion"] is None]
null_scores = [r for r in records if r["toxicity_scores"] is None]
print(f"Null completions: {len(null_completions)}")
print(f"Null scores: {len(null_scores)}")
A small number of nulls (under 1-2%) is acceptable and should be noted in your writeup. A larger number warrants investigation before you proceed.
Schema validation Confirm every record contains the expected fields. An unexpected missing field at this stage is much easier to fix by re-running a small subset than it will be during Week 3 analysis:
expected_keys = {
"prompt_id", "prompt", "demographic_group", "system_prompt",
"completion", "toxicity_scores", "above_threshold",
"model", "temperature", "max_tokens", "timestamp"
}
for i, r in enumerate(records):
missing = expected_keys - set(r.keys())
if missing:
print(f"Record {i} missing fields: {missing}")
Sanity check on scores
import statistics
scored = [r for r in records if r["toxicity_scores"]]
tox = [r["toxicity_scores"]["toxicity"] for r in scored]
print(f"Mean toxicity: {statistics.mean(tox):.3f}")
print(f"Max toxicity: {max(tox):.3f}")
print(f"Min toxicity: {min(tox):.3f}")
from collections import Counter
groups = Counter(r["demographic_group"] for r in records)
print(f"\nRecords per group: {dict(groups)}")
Compare the mean and distribution against your Day 5 smoke test results. They should be in a similar range. A dramatic difference — particularly a much lower mean than expected — could indicate the system prompt is not genuinely empty or that a parameter changed between runs.
Back up the results file Once verified, copy the results file to a second location. This is your primary dataset and regenerating it costs several hours of compute time.
Deliverables Checklist Link to heading
results/baseline_llama31_8b_[date].jsonl— complete, verified, backed up- Record count confirmed against expected prompt total
- Null completion and null score rates documented
- Sanity check statistics recorded in
notes/day6_observations.md - Draft safety prompt candidates started in
notes/safety_prompt_candidates.md - Any errors or anomalies from the run documented with notes on whether they require action
Connection to the rest of the plan Link to heading
This file is your ground truth for the entire project. Every comparison you make in Weeks 2 and 3 — statistical tests, effect sizes, per-group breakdowns, qualitative analysis — is measured against it. Treating the verification step as optional or cursory is the single most likely way to introduce a problem that is painful to diagnose later. It is worth the 30-40 minutes.