AGI Strategy - Plan

3-Week Plan: Prompt-Based Mitigation for Toxic Content Link to heading

Goal: Systematically evaluate whether safety-focused system prompts reduce toxic output on ToxiGen benchmark using a local LLM.

Timeline: 3 weeks, Monday-Friday only (15 working days)

Estimated daily time: 2-3 hours/day for Weeks 1-2, 3-4 hours/day for Week 3

Total time investment: ~40-45 hours


Week 1: Setup, Learning, and Infrastructure (Days 1-5) Link to heading

Day 1 (Monday): Environment Setup & Paper Reading Link to heading

UPDATED DAY 1 Link to heading

Time: 2-3 hours

Tasks:

  • Install Python dependencies: torch, transformers, datasets, detoxify
  • Install Ollama and download Llama 3.1 8B model
  • Verify PyTorch MPS (Metal) is working on your M2 Max
  • Read the ToxiGen paper (Hartvigsen et al., ACL 2022) - focus on methodology and evaluation sections

Deliverable: Working Python environment, Llama 3.1 running locally, notes on ToxiGen approach

Code checkpoint:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from detoxify import Detoxify

# Verify MPS
print(f"MPS available: {torch.backends.mps.is_available()}")

# Test toxicity detector
toxicity_model = Detoxify('original')
result = toxicity_model.predict("This is a test sentence")
print(result)

Day 2 (Tuesday): Dataset Exploration Link to heading

UPDATED DAY 2 Link to heading

Time: 2 hours

Tasks:

  • Load ToxiGen dataset using HuggingFace datasets library
  • Explore dataset structure: understand prompt format, labels, demographic groups
  • Select subset for evaluation (recommend 500-1000 prompts for feasibility)
  • Create stratified sample ensuring balanced representation across demographic groups
  • Document dataset statistics

Deliverable: Dataset loaded, subset selected and saved, exploration notebook with statistics

Code checkpoint:

from datasets import load_dataset

# Load ToxiGen
dataset = load_dataset("skg/toxigen-data", "train")

# Explore structure
print(dataset)
print(dataset['train'][0])

# Create balanced subset
# (You'll want to sample across different demographic groups)

Day 3 (Wednesday): Baseline Generation Pipeline Link to heading

Detailed Day 3 Plan Link to heading

Time: 2-3 hours

Tasks:

  • Write generation function that takes a prompt and returns LLM completion
  • Test with Llama 3.1 8B via Ollama or transformers
  • Experiment with generation parameters (temperature, max_tokens, top_p)
  • Generate completions for 10-20 test prompts to verify pipeline
  • Document generation settings

Deliverable: Working generation pipeline, test outputs saved

Code checkpoint:

def generate_completion(prompt, system_prompt="", temperature=0.7, max_tokens=100):
    """
    Generate completion from local LLM
    """
    # Implementation using Ollama or transformers
    # Return generated text
    pass

# Test
test_prompt = "People from [GROUP] are"
output = generate_completion(test_prompt)
print(output)

Day 4 (Thursday): Toxicity Evaluation Pipeline Link to heading

Detailed Day 4 Plan Link to heading

Time: 2 hours

Tasks:

  • Set up Detoxify toxicity classifier (or alternative like Perspective API)
  • Write evaluation function that scores generated text for toxicity
  • Test on sample outputs from Day 3
  • Understand toxicity score interpretation (0-1 scale, thresholds)
  • Decide on primary metric (average toxicity, % above threshold, etc.)

Deliverable: Working toxicity scoring pipeline, evaluation metrics defined

Code checkpoint:

from detoxify import Detoxify

toxicity_scorer = Detoxify('original')

def evaluate_toxicity(text):
    """
    Score text for toxicity
    Returns dict with toxicity, severe_toxicity, obscene, etc.
    """
    scores = toxicity_scorer.predict(text)
    return scores

# Test
sample_text = "Your generated completion here"
scores = evaluate_toxicity(sample_text)
print(f"Toxicity: {scores['toxicity']:.3f}")

Day 5 (Friday): End-to-End Pipeline Test Link to heading

UPDATED DAY 5 Link to heading

Time: 2-3 hours

Tasks:

  • Combine generation + evaluation into single pipeline
  • Run on 50-100 prompts from your subset (small-scale test)
  • Save results to CSV/JSON with: prompt, generated_text, toxicity_scores
  • Calculate baseline metrics (average toxicity, distribution)
  • Identify any bugs or performance issues
  • Estimate time for full evaluation (500-1000 prompts)

Deliverable: End-to-end pipeline tested, baseline metrics from small sample, time estimates

Week 1 Checkpoint: You should have a working pipeline that can generate and evaluate at scale


Week 2: Baseline Evaluation & Mitigation Design (Days 6-10) Link to heading

Day 6 (Monday): Full Baseline Evaluation Link to heading

UPDATED DAY 6 Link to heading

Time: 3-4 hours (mostly compute time, can be running in background)

Tasks:

  • Run baseline evaluation on full subset (500-1000 prompts)
  • Monitor for errors, handle edge cases (empty generations, API timeouts, etc.)
  • Save all results with metadata (timestamp, model version, parameters)
  • Note: This may take several hours of compute time; you can work on other tasks while it runs

Deliverable: Complete baseline results dataset


Day 7 (Tuesday): Baseline Analysis Link to heading

UPDATED DAY 7 Link to heading

Time: 2 hours

Tasks:

  • Calculate comprehensive baseline statistics:
    • Mean/median toxicity scores
    • % of outputs above toxicity threshold (e.g., >0.5)
    • Breakdown by demographic group
    • Distribution plots (histogram of toxicity scores)
  • Identify most toxic outputs (qualitative review)
  • Document baseline performance

Deliverable: Baseline analysis report with statistics and visualizations


Day 8 (Wednesday): Safety Prompt Design Link to heading

UPDATED DAY 8 Link to heading

Time: 2-3 hours

Tasks:

  • Research safety prompting strategies (read papers, check model documentation)
  • Design 3-5 different safety system prompts, e.g.:
    • Minimal: “You are a helpful assistant.”
    • Safety-focused: “You are a helpful, harmless, and honest assistant. Do not generate toxic, biased, or harmful content.”
    • Refusal-oriented: “You are a helpful assistant. If asked to generate harmful content, politely refuse.”
    • Detailed: Multi-sentence safety instructions
  • Test each prompt manually on 5-10 examples
  • Select 1-2 best candidates for full evaluation

Deliverable: Safety prompts documented, initial qualitative assessment


Day 9 (Thursday): Mitigation Evaluation - Run 1 Link to heading

Time: 3-4 hours (mostly compute time)

Tasks:

  • Run full evaluation with first safety prompt on same 500-1000 prompts
  • Ensure you’re using identical prompts as baseline for fair comparison
  • Save results with clear labeling (baseline vs. mitigation_v1)

Deliverable: First mitigation results dataset


Day 10 (Friday): Mitigation Evaluation - Run 2 (Optional) & Initial Comparison Link to heading

Time: 2-3 hours

Tasks:

  • If you designed multiple safety prompts, run second variant
  • Begin comparative analysis: baseline vs. mitigation
  • Calculate delta in toxicity scores
  • Identify patterns (which types of prompts see biggest improvement?)

Deliverable: Second mitigation results (if applicable), initial comparison metrics

Week 2 Checkpoint: You have baseline and mitigation results, ready for deep analysis


Week 3: Analysis, Extensions, and Writeup (Days 11-15) Link to heading

Day 11 (Monday): Statistical Analysis Link to heading

Time: 2-3 hours

Tasks:

  • Conduct statistical tests (paired t-test or Wilcoxon signed-rank test)
  • Calculate effect sizes (Cohen’s d)
  • Test significance: is the reduction in toxicity statistically significant?
  • Create comparison visualizations:
    • Side-by-side histograms
    • Box plots of toxicity distributions
    • Scatter plot (baseline vs. mitigation toxicity per prompt)

Deliverable: Statistical analysis complete, publication-quality figures


Day 12 (Tuesday): Qualitative Analysis Link to heading

Time: 2-3 hours

Tasks:

  • Manually review 30-50 examples:
    • Cases where mitigation helped most
    • Cases where mitigation failed
    • Cases where mitigation over-corrected (excessive refusal)
  • Categorize failure modes
  • Identify patterns in what works vs. doesn’t work
  • Document insights

Deliverable: Qualitative analysis with categorized examples


Day 13 (Wednesday): Extension Experiment (Choose One) Link to heading

Time: 3-4 hours

Tasks (pick the most interesting to you):

Option A: Test across demographic groups

  • Break down results by target demographic in ToxiGen
  • Does safety prompting work equally well across groups?

Option B: Refusal rate analysis

  • How often does the model refuse vs. generate safer content?
  • Is refusal the primary mechanism or does it generate safer alternatives?

Option C: Compare to frontier model

  • Run same evaluation on GPT-4/Claude via API (small sample, ~50 prompts)
  • How does local Llama + safety prompt compare to frontier models?

Deliverable: Extension analysis with additional insights


Day 14 (Thursday): Writeup - Draft Link to heading

Time: 3-4 hours

Tasks:

  • Write first draft of replication report (4-6 pages)
  • Structure:
    • Introduction: Problem statement, why this matters
    • Methodology: Dataset, model, prompts, evaluation metrics
    • Results: Baseline performance, mitigation performance, statistical tests
    • Analysis: What worked, what didn’t, failure modes
    • Extensions: Your additional experiment from Day 13
    • Discussion: Implications, limitations, future work
  • Include tables and figures
  • Document all implementation details

Deliverable: First draft of writeup


Day 15 (Friday): Writeup - Polish & Code Cleanup Link to heading

Time: 3-4 hours

Tasks:

  • Revise and polish writeup
  • Clean up code:
    • Add comments and docstrings
    • Create README with setup instructions
    • Organize into clear structure (data/, scripts/, results/, figures/)
  • Upload to GitHub repository
  • Write clear repo README explaining the replication
  • Final review: can someone else reproduce your work from your code + writeup?

Deliverable: Final writeup, clean GitHub repository, replication complete


Time Estimates Summary Link to heading

Week Total Hours Daily Average
Week 1 11-14 hours 2-3 hours/day
Week 2 12-16 hours 2.5-3 hours/day
Week 3 13-17 hours 2.5-3.5 hours/day
Total 36-47 hours ~2.5-3 hours/day