AGI Strategy - Day 3
Day 3 Detailed Plan: Baseline Generation Pipeline Link to heading
Context from prior days: By this point you should have a working Python environment with Ollama and Llama 3.1 8B installed (Day 1), and a saved, stratified subset of ToxiGen prompts (Day 2). Day 3 builds directly on both.
Session Structure (2-3 hours) Link to heading
Block 1 — Write the core generation function (45-60 min) Link to heading
The function signature is already sketched in your plan. Flesh it out with the following considerations:
For Ollama:
import requests
import json
OLLAMA_URL = "http://localhost:11434/api/generate"
MODEL_NAME = "llama3.1:8b"
def generate_completion(
prompt: str,
system_prompt: str = "",
temperature: float = 0.7,
max_tokens: int = 100,
top_p: float = 0.9,
) -> str:
payload = {
"model": MODEL_NAME,
"prompt": prompt,
"system": system_prompt,
"stream": False,
"options": {
"temperature": temperature,
"num_predict": max_tokens,
"top_p": top_p,
},
}
response = requests.post(OLLAMA_URL, json=payload, timeout=60)
response.raise_for_status()
return response.json()["response"].strip()
Key decisions to make explicit in your code comments:
system_prompt=""is intentional — this is the no-mitigation baseline. Leaving it empty or using a neutral value establishes your control condition.temperature=0.7is a reasonable default for open-ended generation, but you will want to test this (see Block 2).max_tokens=100keeps outputs short and consistent with ToxiGen’s evaluation conventions. Check the original paper to confirm what completion length they used; you may want to match it.
Error handling — add this from the start, not as an afterthought:
def generate_completion(...) -> str | None:
try:
# ... request logic ...
except requests.exceptions.Timeout:
print(f"[WARN] Timeout on prompt: {prompt[:60]}")
return None
except requests.exceptions.RequestException as e:
print(f"[ERROR] Request failed: {e}")
return None
Returning None on failure rather than raising an exception will matter when you run 500-1000 prompts on Day 6 — you do not want a single failure to abort the entire run.
Block 2 — Parameter sensitivity check (30-45 min) Link to heading
Pick 5-10 prompts from your Day 2 subset (sample across at least 2-3 demographic groups) and run each through the generation function at varying settings. You are not doing a full grid search — just enough to make an informed, documented decision.
| Parameter | Values to test | What to watch for |
|---|---|---|
temperature |
0.0, 0.5, 0.7, 1.0 | At 0.0, outputs are deterministic and may be more stilted; at 1.0, more varied but potentially noisier for scoring |
max_tokens |
50, 100, 150 | Shorter outputs reduce the surface area for toxicity but may truncate mid-sentence, complicating scoring |
top_p |
0.9, 1.0 | Minor effect in practice; 0.9 is a safe default |
Document your chosen settings in a config.py or a simple dict at the top of your script. Reproducibility depends on this being explicit and version-controlled.
Important note on determinism: If you want results that are fully reproducible, set temperature=0.0. However, this may not reflect realistic model behavior. A common compromise is to use a fixed random seed if your backend supports it, or to acknowledge stochasticity explicitly in your writeup. Decide now and document it.
Block 3 — Run the smoke test (30 min) Link to heading
Generate completions for 10-20 prompts from your saved subset and write results to a file immediately. Do not just print to console.
import json
from pathlib import Path
def run_smoke_test(prompts: list[str], output_path: str = "results/smoke_test.jsonl"):
Path("results").mkdir(exist_ok=True)
results = []
for i, prompt in enumerate(prompts):
completion = generate_completion(prompt)
record = {
"id": i,
"prompt": prompt,
"completion": completion,
"model": MODEL_NAME,
"temperature": 0.7,
"max_tokens": 100,
}
results.append(record)
print(f"[{i+1}/{len(prompts)}] Done")
with open(output_path, "w") as f:
for r in results:
f.write(json.dumps(r) + "\n")
print(f"Saved {len(results)} results to {output_path}")
JSONL (one JSON object per line) is preferable to a single JSON array here — it is easier to append to incrementally and survives partial writes during long runs.
Block 4 — Qualitative review and documentation (15-20 min) Link to heading
Read through the smoke test outputs manually. You are not scoring toxicity yet (that is Day 4), but you want to confirm:
- Completions are coherent and non-empty
- The model is continuing the prompts, not refusing or hallucinating errors
- Output length is appropriate given your
max_tokenssetting - No obvious encoding or formatting issues
Record any surprises in a notes/day3_observations.md file. This feeds directly into your Week 3 qualitative analysis section.
Deliverables Checklist Link to heading
generation.py(or equivalent) withgenerate_completionfunction, error handling, and documented parametersconfig.pyor equivalent with all generation settings explicitly definedresults/smoke_test.jsonlwith 10-20 generated completionsnotes/day3_observations.mdwith parameter sensitivity findings and qualitative notes- Estimated time-per-completion recorded (needed for Day 5’s time-to-full-run estimate)
Connection to the rest of the plan Link to heading
The generation function you write today is the component everything else depends on. Day 4 wraps it with a toxicity scorer; Day 5 combines both into the full pipeline; Days 6 and 9 run it at scale. Investing time now in clean error handling and explicit configuration will save significant debugging time later. Resist the urge to hard-code parameters.
One question worth thinking through before you start: the ToxiGen prompts are designed to elicit toxic completions — they are intentionally adversarial. Review a few from your Day 2 subset before writing the generation code so you have a realistic sense of what the model will be processing. This is relevant both for setting max_tokens appropriately and for your own situational awareness going into the smoke test.