raw/
eu_ai_act.xml — Source from EUR-Lex
gdpr.xml — Source from EUR-Lex
clauses/
eu_ai_act_clauses.json — Parsed atomic clauses
gdpr_clauses.json
combined_clauses.json — Merged, deduplicated
scenarios/
scenarios_raw.json — Direct LLM output
scenarios_reviewed.json — After human QA
embeddings/
clauses.index — Chroma or FAISS vector store
parse/
parse_eu_ai_act.py
parse_gdpr.py
combine_clauses.py
scenarios/
generate_scenarios.py — Batched API calls to generate scenarios
review_scenarios.py — CLI tool for human QA pass
benchmark/
runner.py — Sends scenarios to models, logs responses
evaluator.py — LLM-as-judge + RAG evaluation logic
retriever.py — Chroma/FAISS retrieval wrapper
aggregate.py — Rolls up scores to per-model/per-clause stats
utils/
api_client.py — Unified wrapper: Ollama + Anthropic/OpenAI
logger.py — Structured logging shared across modules
registry.yaml — Model inventory: name, type, path/endpoint
system_prompts/
llama3_safety.txt — System prompt for safety variant
mistral_safety.txt
adapters/ — LoRA weights if fine-tuning is used
llama3_safety_lora/
mistral_safety_lora/
raw/
run_YYYYMMDD_HHMMSS/ — One folder per benchmark run
llama3_8b.jsonl
mistral_7b.jsonl
gemma2_9b.jsonl
qwen2_7b.jsonl
llama3_8b_safety.jsonl
mistral_7b_safety.jsonl
evaluated/
run_YYYYMMDD_HHMMSS/ — Mirrors raw run folder
scores.jsonl — Per-response scores + reasoning
summary.json — Aggregate stats for this run
final/
combined_scores.csv — All runs merged, for write-up
charts/ — Generated figures
explore_clauses.ipynb — Sanity-check parsed data
explore_scenarios.ipynb
analyse_results.ipynb — Produce charts and tables for write-up
draft.md — Main document
references.bib
figures/ — Copies of charts used in document
settings.yaml — Paths, model names, API targets, batch sizes
rubric.yaml — Evaluator scoring rubric (shared by all runs)
test_parser.py
test_runner.py
test_evaluator.py
.env — API keys — never commit
.gitignore
requirements.txt
README.md