AI Generated Learning Plan

We were given the task of using an LLM to generate a learning plan for our project. This is what Claude Sonnet 4.6 created.

Learning Roadmap: Local LLM Safety Testing Link to heading

Phase Duration Key Activities Success Criteria
1. Environment Setup 1-2 days Install Ollama or LM Studio; download a model (Llama 3.1 8B or Mistral 7B); verify GPU acceleration working on M2 Max Successfully run inference locally with acceptable speed (>20 tokens/sec)
2. Baseline Testing 2-3 days Select/create test prompts from ToxiGen or BBQ; run baseline evaluation; document model responses; establish scoring methodology Complete 50-100 test prompts with documented baseline scores
3. Intervention Method 3-5 days Choose mitigation approach (system prompts, fine-tuning, or RAG); implement the intervention; validate it’s working correctly Intervention successfully applied without breaking model functionality
4. Post-Intervention Testing 1-2 days Re-run identical test suite; score responses using same methodology; compare results quantitatively Documented comparison showing measurable change in safety metrics
5. Analysis & Documentation 1-2 days Analyze what worked/didn’t work; document limitations; identify next steps for deeper investigation Written report with findings, methodology, and lessons learned

Total estimated time: 8-14 days (assuming part-time effort)


Detailed Phase Breakdown Link to heading

Phase 1: Environment Setup Link to heading

Install a local LLM runtime:

  • Ollama (recommended for simplicity) - command-line tool with easy model management
  • LM Studio (recommended for GUI) - visual interface, good for beginners
  • llama.cpp (advanced) - maximum control but steeper learning curve

Select an appropriate model:

  • Start with Llama 3.1 8B or Mistral 7B - both run well on M2 Max and have known safety characteristics
  • Avoid models >13B parameters initially due to memory constraints
  • Use quantized versions (Q4 or Q5) for better performance

Verify performance:

  • Check that Metal GPU acceleration is enabled
  • Monitor memory usage and token generation speed
  • Test with simple prompts to ensure stability

Phase 2: Baseline Testing Link to heading

Acquire or create test datasets:

  • Download ToxiGen test set from Hugging Face (focuses on hate speech/toxicity)
  • Use BBQ for bias testing (available on GitHub)
  • Create 20-30 custom prompts specific to your Nazi Germany example

Establish evaluation methodology:

  • Define clear scoring rubric (e.g., 0=appropriate refusal, 1=neutral, 2=problematic, 3=harmful)
  • Decide whether to score manually or use an automated evaluator
  • Document exact prompts and expected behaviors

Run baseline evaluation:

  • Use Python script or simple automation to run prompts consistently
  • Save all raw outputs for later comparison
  • Calculate aggregate scores across categories

Phase 3: Intervention Method Link to heading

Choose your mitigation approach (pick one to start):

Option A: System Prompt Engineering (easiest, 1-2 days)

  • Craft safety-focused system prompts
  • Test variations to find effective phrasing
  • Pros: No technical complexity, reversible
  • Cons: Limited effectiveness, can be circumvented

Option B: Fine-tuning with Safety Data (moderate, 3-4 days)

  • Use LoRA/QLoRA for parameter-efficient fine-tuning
  • Create or source safety-aligned training data (Constitutional AI examples, refusal examples)
  • Tools: Hugging Face transformers + PEFT library
  • Pros: More robust improvements
  • Cons: Requires ML knowledge, can degrade general capabilities

Option C: RAG with Safety Guidelines (moderate, 2-3 days)

  • Implement retrieval system that injects safety guidelines
  • Use vector database (ChromaDB, FAISS) to retrieve relevant safety rules
  • Pros: Flexible, explainable
  • Cons: Adds latency, requires prompt engineering

Phase 4: Post-Intervention Testing Link to heading

Ensure identical test conditions:

  • Use exact same prompts from baseline
  • Same temperature/sampling parameters
  • Same model version (or document if changed)

Run comparative evaluation:

  • Score using identical rubric
  • Look for both improvements and regressions
  • Test edge cases that weren’t in training data

Phase 5: Analysis & Documentation Link to heading

Quantitative analysis:

  • Calculate percentage improvement in safety scores
  • Identify which categories improved most/least
  • Measure any degradation in helpfulness or accuracy

Qualitative analysis:

  • Review specific examples of changed behavior
  • Identify failure modes that remain
  • Document unexpected behaviors

Suggested Improvements to Your Plan Link to heading

Add a control dimension: Test the model on benign prompts before and after intervention to ensure you haven’t degraded general capabilities. Safety improvements that break normal functionality aren’t useful.

Start smaller and iterate: Rather than testing “harmful biases” broadly, focus on one specific category (e.g., extremist ideology normalization) first. This makes evaluation more manageable and findings more actionable.

Plan for reproducibility: Document exact model versions, quantization levels, system prompts, and random seeds. This allows you to replicate results and troubleshoot issues.

Consider automated evaluation: Manually scoring 100+ responses is tedious. Consider using a stronger model (GPT-4, Claude) as a judge to score safety, or use existing toxicity classifiers like Perspective API for initial filtering.

Include adversarial testing: After your intervention works on straightforward prompts, test with jailbreak attempts or rephrased versions to see if safety measures are robust.

Measure computational costs: Track inference time and memory usage before/after intervention. Some safety methods significantly impact performance.


Technical Stack Recommendation Link to heading

For a working professional with limited ML background:

  • Runtime: Ollama (simplest setup)
  • Model: Llama 3.1 8B Instruct (Q4 quantization)
  • Testing framework: Python script with simple CSV logging
  • Intervention: Start with system prompt engineering, then try fine-tuning if time permits
  • Evaluation: Manual scoring for first 50 examples, then automated judge for scale

Estimated budget: $0 (all open-source tools)


Key Resources Link to heading

  • Ollama documentation and model library
  • Hugging Face datasets (ToxiGen, BBQ)
  • PEFT library documentation (for fine-tuning)
  • LangChain or LlamaIndex (for RAG implementation)