AI Generated Learning Plan

We were given the task of using an LLM to generate a learning plan for our project. This is what Claude Sonnet 4.6 created.

Learning Roadmap: Local LLM Safety Testing Link to heading

Phase	Duration	Key Activities	Success Criteria
1. Environment Setup	1-2 days	Install Ollama or LM Studio; download a model (Llama 3.1 8B or Mistral 7B); verify GPU acceleration working on M2 Max	Successfully run inference locally with acceptable speed (>20 tokens/sec)
2. Baseline Testing	2-3 days	Select/create test prompts from ToxiGen or BBQ; run baseline evaluation; document model responses; establish scoring methodology	Complete 50-100 test prompts with documented baseline scores
3. Intervention Method	3-5 days	Choose mitigation approach (system prompts, fine-tuning, or RAG); implement the intervention; validate it’s working correctly	Intervention successfully applied without breaking model functionality
4. Post-Intervention Testing	1-2 days	Re-run identical test suite; score responses using same methodology; compare results quantitatively	Documented comparison showing measurable change in safety metrics
5. Analysis & Documentation	1-2 days	Analyze what worked/didn’t work; document limitations; identify next steps for deeper investigation	Written report with findings, methodology, and lessons learned

Total estimated time: 8-14 days (assuming part-time effort)

Detailed Phase Breakdown Link to heading

Phase 1: Environment Setup Link to heading

Install a local LLM runtime:

Ollama (recommended for simplicity) - command-line tool with easy model management
LM Studio (recommended for GUI) - visual interface, good for beginners
llama.cpp (advanced) - maximum control but steeper learning curve

Select an appropriate model:

Start with Llama 3.1 8B or Mistral 7B - both run well on M2 Max and have known safety characteristics
Avoid models >13B parameters initially due to memory constraints
Use quantized versions (Q4 or Q5) for better performance

Verify performance:

Check that Metal GPU acceleration is enabled
Monitor memory usage and token generation speed
Test with simple prompts to ensure stability

Phase 2: Baseline Testing Link to heading

Acquire or create test datasets:

Download ToxiGen test set from Hugging Face (focuses on hate speech/toxicity)
Use BBQ for bias testing (available on GitHub)
Create 20-30 custom prompts specific to your Nazi Germany example

Establish evaluation methodology:

Define clear scoring rubric (e.g., 0=appropriate refusal, 1=neutral, 2=problematic, 3=harmful)
Decide whether to score manually or use an automated evaluator
Document exact prompts and expected behaviors

Run baseline evaluation:

Use Python script or simple automation to run prompts consistently
Save all raw outputs for later comparison
Calculate aggregate scores across categories

Phase 3: Intervention Method Link to heading

Choose your mitigation approach (pick one to start):

Option A: System Prompt Engineering (easiest, 1-2 days)

Craft safety-focused system prompts
Test variations to find effective phrasing
Pros: No technical complexity, reversible
Cons: Limited effectiveness, can be circumvented

Option B: Fine-tuning with Safety Data (moderate, 3-4 days)

Use LoRA/QLoRA for parameter-efficient fine-tuning
Create or source safety-aligned training data (Constitutional AI examples, refusal examples)
Tools: Hugging Face transformers + PEFT library
Pros: More robust improvements
Cons: Requires ML knowledge, can degrade general capabilities

Option C: RAG with Safety Guidelines (moderate, 2-3 days)

Implement retrieval system that injects safety guidelines
Use vector database (ChromaDB, FAISS) to retrieve relevant safety rules
Pros: Flexible, explainable
Cons: Adds latency, requires prompt engineering

Phase 4: Post-Intervention Testing Link to heading

Ensure identical test conditions:

Use exact same prompts from baseline
Same temperature/sampling parameters
Same model version (or document if changed)

Run comparative evaluation:

Score using identical rubric
Look for both improvements and regressions
Test edge cases that weren’t in training data

Phase 5: Analysis & Documentation Link to heading

Quantitative analysis:

Calculate percentage improvement in safety scores
Identify which categories improved most/least
Measure any degradation in helpfulness or accuracy

Qualitative analysis:

Review specific examples of changed behavior
Identify failure modes that remain
Document unexpected behaviors

Technical Stack Recommendation Link to heading

For a working professional with limited ML background:

Runtime: Ollama (simplest setup)
Model: Llama 3.1 8B Instruct (Q4 quantization)
Testing framework: Python script with simple CSV logging
Intervention: Start with system prompt engineering, then try fine-tuning if time permits
Evaluation: Manual scoring for first 50 examples, then automated judge for scale

Estimated budget: $0 (all open-source tools)

Key Resources Link to heading

Ollama documentation and model library
Hugging Face datasets (ToxiGen, BBQ)
PEFT library documentation (for fine-tuning)
LangChain or LlamaIndex (for RAG implementation)