AI Generated Learning Plan
We were given the task of using an LLM to generate a learning plan for our project. This is what Claude Sonnet 4.6 created.
Learning Roadmap: Local LLM Safety Testing Link to heading
| Phase | Duration | Key Activities | Success Criteria |
|---|---|---|---|
| 1. Environment Setup | 1-2 days | Install Ollama or LM Studio; download a model (Llama 3.1 8B or Mistral 7B); verify GPU acceleration working on M2 Max | Successfully run inference locally with acceptable speed (>20 tokens/sec) |
| 2. Baseline Testing | 2-3 days | Select/create test prompts from ToxiGen or BBQ; run baseline evaluation; document model responses; establish scoring methodology | Complete 50-100 test prompts with documented baseline scores |
| 3. Intervention Method | 3-5 days | Choose mitigation approach (system prompts, fine-tuning, or RAG); implement the intervention; validate it’s working correctly | Intervention successfully applied without breaking model functionality |
| 4. Post-Intervention Testing | 1-2 days | Re-run identical test suite; score responses using same methodology; compare results quantitatively | Documented comparison showing measurable change in safety metrics |
| 5. Analysis & Documentation | 1-2 days | Analyze what worked/didn’t work; document limitations; identify next steps for deeper investigation | Written report with findings, methodology, and lessons learned |
Total estimated time: 8-14 days (assuming part-time effort)
Detailed Phase Breakdown Link to heading
Phase 1: Environment Setup Link to heading
Install a local LLM runtime:
- Ollama (recommended for simplicity) - command-line tool with easy model management
- LM Studio (recommended for GUI) - visual interface, good for beginners
- llama.cpp (advanced) - maximum control but steeper learning curve
Select an appropriate model:
- Start with Llama 3.1 8B or Mistral 7B - both run well on M2 Max and have known safety characteristics
- Avoid models >13B parameters initially due to memory constraints
- Use quantized versions (Q4 or Q5) for better performance
Verify performance:
- Check that Metal GPU acceleration is enabled
- Monitor memory usage and token generation speed
- Test with simple prompts to ensure stability
Phase 2: Baseline Testing Link to heading
Acquire or create test datasets:
- Download ToxiGen test set from Hugging Face (focuses on hate speech/toxicity)
- Use BBQ for bias testing (available on GitHub)
- Create 20-30 custom prompts specific to your Nazi Germany example
Establish evaluation methodology:
- Define clear scoring rubric (e.g., 0=appropriate refusal, 1=neutral, 2=problematic, 3=harmful)
- Decide whether to score manually or use an automated evaluator
- Document exact prompts and expected behaviors
Run baseline evaluation:
- Use Python script or simple automation to run prompts consistently
- Save all raw outputs for later comparison
- Calculate aggregate scores across categories
Phase 3: Intervention Method Link to heading
Choose your mitigation approach (pick one to start):
Option A: System Prompt Engineering (easiest, 1-2 days)
- Craft safety-focused system prompts
- Test variations to find effective phrasing
- Pros: No technical complexity, reversible
- Cons: Limited effectiveness, can be circumvented
Option B: Fine-tuning with Safety Data (moderate, 3-4 days)
- Use LoRA/QLoRA for parameter-efficient fine-tuning
- Create or source safety-aligned training data (Constitutional AI examples, refusal examples)
- Tools: Hugging Face transformers + PEFT library
- Pros: More robust improvements
- Cons: Requires ML knowledge, can degrade general capabilities
Option C: RAG with Safety Guidelines (moderate, 2-3 days)
- Implement retrieval system that injects safety guidelines
- Use vector database (ChromaDB, FAISS) to retrieve relevant safety rules
- Pros: Flexible, explainable
- Cons: Adds latency, requires prompt engineering
Phase 4: Post-Intervention Testing Link to heading
Ensure identical test conditions:
- Use exact same prompts from baseline
- Same temperature/sampling parameters
- Same model version (or document if changed)
Run comparative evaluation:
- Score using identical rubric
- Look for both improvements and regressions
- Test edge cases that weren’t in training data
Phase 5: Analysis & Documentation Link to heading
Quantitative analysis:
- Calculate percentage improvement in safety scores
- Identify which categories improved most/least
- Measure any degradation in helpfulness or accuracy
Qualitative analysis:
- Review specific examples of changed behavior
- Identify failure modes that remain
- Document unexpected behaviors
Suggested Improvements to Your Plan Link to heading
Add a control dimension: Test the model on benign prompts before and after intervention to ensure you haven’t degraded general capabilities. Safety improvements that break normal functionality aren’t useful.
Start smaller and iterate: Rather than testing “harmful biases” broadly, focus on one specific category (e.g., extremist ideology normalization) first. This makes evaluation more manageable and findings more actionable.
Plan for reproducibility: Document exact model versions, quantization levels, system prompts, and random seeds. This allows you to replicate results and troubleshoot issues.
Consider automated evaluation: Manually scoring 100+ responses is tedious. Consider using a stronger model (GPT-4, Claude) as a judge to score safety, or use existing toxicity classifiers like Perspective API for initial filtering.
Include adversarial testing: After your intervention works on straightforward prompts, test with jailbreak attempts or rephrased versions to see if safety measures are robust.
Measure computational costs: Track inference time and memory usage before/after intervention. Some safety methods significantly impact performance.
Technical Stack Recommendation Link to heading
For a working professional with limited ML background:
- Runtime: Ollama (simplest setup)
- Model: Llama 3.1 8B Instruct (Q4 quantization)
- Testing framework: Python script with simple CSV logging
- Intervention: Start with system prompt engineering, then try fine-tuning if time permits
- Evaluation: Manual scoring for first 50 examples, then automated judge for scale
Estimated budget: $0 (all open-source tools)
Key Resources Link to heading
- Ollama documentation and model library
- Hugging Face datasets (ToxiGen, BBQ)
- PEFT library documentation (for fine-tuning)
- LangChain or LlamaIndex (for RAG implementation)