AGI Strategy - Day 3

Day 3 Detailed Plan: Baseline Generation Pipeline Link to heading

Context from prior days: By this point you should have a working Python environment with Ollama and Llama 3.1 8B installed (Day 1), and a saved, stratified subset of ToxiGen prompts (Day 2). Day 3 builds directly on both.


Session Structure (2-3 hours) Link to heading


Block 1 — Write the core generation function (45-60 min) Link to heading

The function signature is already sketched in your plan. Flesh it out with the following considerations:

For Ollama:

import requests
import json

OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.1:8b"

def generate_completion(
    prompt: str,
    system_prompt: str = "",
    temperature: float = 0.7,
    max_tokens: int = 100,
    top_p: float = 0.9,
) -> str:
    payload = {
        "model": MODEL_NAME,
        "prompt": prompt,
        "system": system_prompt,
        "stream": False,
        "options": {
            "temperature": temperature,
            "num_predict": max_tokens,
            "top_p": top_p,
        },
    }
    response = requests.post(OLLAMA_URL, json=payload, timeout=60)
    response.raise_for_status()
    return response.json()["response"].strip()

Key decisions to make explicit in your code comments:

  • system_prompt="" is intentional — this is the no-mitigation baseline. Leaving it empty or using a neutral value establishes your control condition.
  • temperature=0.7 is a reasonable default for open-ended generation, but you will want to test this (see Block 2).
  • max_tokens=100 keeps outputs short and consistent with ToxiGen’s evaluation conventions. Check the original paper to confirm what completion length they used; you may want to match it.

Error handling — add this from the start, not as an afterthought:

def generate_completion(...) -> str | None:
    try:
        # ... request logic ...
    except requests.exceptions.Timeout:
        print(f"[WARN] Timeout on prompt: {prompt[:60]}")
        return None
    except requests.exceptions.RequestException as e:
        print(f"[ERROR] Request failed: {e}")
        return None

Returning None on failure rather than raising an exception will matter when you run 500-1000 prompts on Day 6 — you do not want a single failure to abort the entire run.


Block 2 — Parameter sensitivity check (30-45 min) Link to heading

Pick 5-10 prompts from your Day 2 subset (sample across at least 2-3 demographic groups) and run each through the generation function at varying settings. You are not doing a full grid search — just enough to make an informed, documented decision.

Parameter Values to test What to watch for
temperature 0.0, 0.5, 0.7, 1.0 At 0.0, outputs are deterministic and may be more stilted; at 1.0, more varied but potentially noisier for scoring
max_tokens 50, 100, 150 Shorter outputs reduce the surface area for toxicity but may truncate mid-sentence, complicating scoring
top_p 0.9, 1.0 Minor effect in practice; 0.9 is a safe default

Document your chosen settings in a config.py or a simple dict at the top of your script. Reproducibility depends on this being explicit and version-controlled.

Important note on determinism: If you want results that are fully reproducible, set temperature=0.0. However, this may not reflect realistic model behavior. A common compromise is to use a fixed random seed if your backend supports it, or to acknowledge stochasticity explicitly in your writeup. Decide now and document it.


Block 3 — Run the smoke test (30 min) Link to heading

Generate completions for 10-20 prompts from your saved subset and write results to a file immediately. Do not just print to console.

import json
from pathlib import Path

def run_smoke_test(prompts: list[str], output_path: str = "results/smoke_test.jsonl"):
    Path("results").mkdir(exist_ok=True)
    results = []
    for i, prompt in enumerate(prompts):
        completion = generate_completion(prompt)
        record = {
            "id": i,
            "prompt": prompt,
            "completion": completion,
            "model": MODEL_NAME,
            "temperature": 0.7,
            "max_tokens": 100,
        }
        results.append(record)
        print(f"[{i+1}/{len(prompts)}] Done")

    with open(output_path, "w") as f:
        for r in results:
            f.write(json.dumps(r) + "\n")

    print(f"Saved {len(results)} results to {output_path}")

JSONL (one JSON object per line) is preferable to a single JSON array here — it is easier to append to incrementally and survives partial writes during long runs.


Block 4 — Qualitative review and documentation (15-20 min) Link to heading

Read through the smoke test outputs manually. You are not scoring toxicity yet (that is Day 4), but you want to confirm:

  • Completions are coherent and non-empty
  • The model is continuing the prompts, not refusing or hallucinating errors
  • Output length is appropriate given your max_tokens setting
  • No obvious encoding or formatting issues

Record any surprises in a notes/day3_observations.md file. This feeds directly into your Week 3 qualitative analysis section.


Deliverables Checklist Link to heading

  • generation.py (or equivalent) with generate_completion function, error handling, and documented parameters
  • config.py or equivalent with all generation settings explicitly defined
  • results/smoke_test.jsonl with 10-20 generated completions
  • notes/day3_observations.md with parameter sensitivity findings and qualitative notes
  • Estimated time-per-completion recorded (needed for Day 5’s time-to-full-run estimate)

Connection to the rest of the plan Link to heading

The generation function you write today is the component everything else depends on. Day 4 wraps it with a toxicity scorer; Day 5 combines both into the full pipeline; Days 6 and 9 run it at scale. Investing time now in clean error handling and explicit configuration will save significant debugging time later. Resist the urge to hard-code parameters.


One question worth thinking through before you start: the ToxiGen prompts are designed to elicit toxic completions — they are intentionally adversarial. Review a few from your Day 2 subset before writing the generation code so you have a realistic sense of what the model will be processing. This is relevant both for setting max_tokens appropriately and for your own situational awareness going into the smoke test.