AGI Strategy - Day 3

Day 3 Detailed Plan: Baseline Generation Pipeline Link to heading

Context from prior days: By this point you should have a working Python environment with Ollama and Llama 3.1 8B installed (Day 1), and a saved, stratified subset of ToxiGen prompts (Day 2). Day 3 builds directly on both.

Session Structure (2-3 hours) Link to heading

Block 1 — Write the core generation function (45-60 min) Link to heading

The function signature is already sketched in your plan. Flesh it out with the following considerations:

For Ollama:

import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.1:8b"

def generate_completion(
    prompt: str,
    system_prompt: str = "",
    temperature: float = 0.7,
    max_tokens: int = 100,
    top_p: float = 0.9,
) -> str:
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "system": system_prompt,
        "stream": False,
        "options": {
            "temperature": temperature,
            "num_predict": max_tokens,
            "top_p": top_p,
        },
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=60)
    response.raise_for_status()
    return response.json()["response"].strip()

Key decisions to make explicit in your code comments:

system_prompt="" is intentional — this is the no-mitigation baseline. Leaving it empty or using a neutral value establishes your control condition.
temperature=0.7 is a reasonable default for open-ended generation, but you will want to test this (see Block 2).
max_tokens=100 keeps outputs short and consistent with ToxiGen’s evaluation conventions. Check the original paper to confirm what completion length they used; you may want to match it.

Error handling — add this from the start, not as an afterthought:

def generate_completion(...) -> str | None:
    try:
        # ... request logic ...
    except requests.exceptions.Timeout:
        print(f"[WARN] Timeout on prompt: {prompt[:60]}")
        return None
    except requests.exceptions.RequestException as e:
        print(f"[ERROR] Request failed: {e}")
        return None

Returning None on failure rather than raising an exception will matter when you run 500-1000 prompts on Day 6 — you do not want a single failure to abort the entire run.

Block 2 — Parameter sensitivity check (30-45 min) Link to heading

Pick 5-10 prompts from your Day 2 subset (sample across at least 2-3 demographic groups) and run each through the generation function at varying settings. You are not doing a full grid search — just enough to make an informed, documented decision.

Parameter	Values to test	What to watch for
`temperature`	0.0, 0.5, 0.7, 1.0	At 0.0, outputs are deterministic and may be more stilted; at 1.0, more varied but potentially noisier for scoring
`max_tokens`	50, 100, 150	Shorter outputs reduce the surface area for toxicity but may truncate mid-sentence, complicating scoring
`top_p`	0.9, 1.0	Minor effect in practice; 0.9 is a safe default

Document your chosen settings in a config.py or a simple dict at the top of your script. Reproducibility depends on this being explicit and version-controlled.

Important note on determinism: If you want results that are fully reproducible, set temperature=0.0. However, this may not reflect realistic model behavior. A common compromise is to use a fixed random seed if your backend supports it, or to acknowledge stochasticity explicitly in your writeup. Decide now and document it.

Block 3 — Run the smoke test (30 min) Link to heading

Generate completions for 10-20 prompts from your saved subset and write results to a file immediately. Do not just print to console.

import json
from pathlib import Path

def run_smoke_test(prompts: list[str], output_path: str = "results/smoke_test.jsonl"):
    Path("results").mkdir(exist_ok=True)
    results = []
    for i, prompt in enumerate(prompts):
        completion = generate_completion(prompt)
        record = {
            "id": i,
            "prompt": prompt,
            "completion": completion,
            "model": MODEL_NAME,
            "temperature": 0.7,
            "max_tokens": 100,
        }
        results.append(record)
        print(f"[{i+1}/{len(prompts)}] Done")

    with open(output_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

    print(f"Saved {len(results)} results to {output_path}")

JSONL (one JSON object per line) is preferable to a single JSON array here — it is easier to append to incrementally and survives partial writes during long runs.

Block 4 — Qualitative review and documentation (15-20 min) Link to heading

Read through the smoke test outputs manually. You are not scoring toxicity yet (that is Day 4), but you want to confirm:

Completions are coherent and non-empty
The model is continuing the prompts, not refusing or hallucinating errors
Output length is appropriate given your max_tokens setting
No obvious encoding or formatting issues

Record any surprises in a notes/day3_observations.md file. This feeds directly into your Week 3 qualitative analysis section.

Deliverables Checklist Link to heading

generation.py (or equivalent) with generate_completion function, error handling, and documented parameters
config.py or equivalent with all generation settings explicitly defined
results/smoke_test.jsonl with 10-20 generated completions
notes/day3_observations.md with parameter sensitivity findings and qualitative notes
Estimated time-per-completion recorded (needed for Day 5’s time-to-full-run estimate)

Connection to the rest of the plan Link to heading

The generation function you write today is the component everything else depends on. Day 4 wraps it with a toxicity scorer; Day 5 combines both into the full pipeline; Days 6 and 9 run it at scale. Investing time now in clean error handling and explicit configuration will save significant debugging time later. Resist the urge to hard-code parameters.

One question worth thinking through before you start: the ToxiGen prompts are designed to elicit toxic completions — they are intentionally adversarial. Review a few from your Day 2 subset before writing the generation code so you have a realistic sense of what the model will be processing. This is relevant both for setting max_tokens appropriately and for your own situational awareness going into the smoke test.