AGI Strategy - Personal Action Plan

TL;DR Link to heading

Summary Link to heading

I will be researching the current prevelance of harmful content (specifics tbd) and reporting on it (phase 1). Following that, I will be setting up a local LLM so that I can test and use the ToxiGen benchmark (phase 2). After that has been setup, I will research known methods to reduce harmful content, implement that in the LLM environment and retest. Following this, I will write reports to summarize the results and learnings that happened during the two phases of this project.

Over the next month, I will spend 3 weeks working on this project full-time and the fourth week will (hopefully) be spend participating in the Technical AI Safety Course. If the project is not completed before that course, I will continue it after until it is complete. During this time, I will write regular progress reports about the things that I have done, tried and learned.

This project achieves several goals: it allows me to combine technical, research and writing/policy activities to continue to guide me as I try to decide what direction I want to go within the AI Safety Arena. As I progress through the BlueDot courses, my goal is to narrow down my areas of interest so that when I start applying to jobs, I have a better idea of what I really want to work on.

Regardless of my interests, there is still harmful content that is shared every day on Frontier models. I hope to use this experiment to see the state of frontier models today and learn how to contribute technically to the process.

Strategic assessment Link to heading

My focus is in the Input data filtering category which strengthens Layer 2: Constrain. My test evaluates the current state of harmful information in modern day Frontier models and the steps that can be taken to mitigate it. This is very important because given the wide usage of AI, if harmful content exists, it can educate people incorrectly.

Your learning plan Link to heading

How could you learn more about this intervention?
- Learning about this will require a multi-pronged approach:
  - Research existing benchmarks and what they test. Can they be improved?
  - What work has been done in the past to improve safety in this area? Can this be improved upon? I will find research articles on this so that I can conduct research into processes that already exist to increase safety.
What are the top 5 resources on this?
If you ask an AI to generate a learning roadmap for you on this intervention, what does it say?

Link to full report
Who are the top experts that you should consider contacting, once you’ve done a bit of thinking and writing on this?
- Benchmark development:
  - Massachusetts Institute of Technology, University of Washington, Allen Institute for AI, Carnegie Mellon University and Microsoft (ToxiGen, above)
  - NYU for BBQ (Bias Benchmark)
  - Oxford University, OpenAI (TruthfulQA)
- Running and report benchmarks:
  - Center for AI Safety
  - EdSafe AI Alliance
  - National Institute for Standards and Technology (CAISI)
- Improvement of Frontier AI systems:
  - Anthropic
  - OpenAI
  - DeepMind

Your theory of change Link to heading

IF I report existing harms and learn how to run AI safety benchmarks
THEN I will know the current status of harmful content and can report it to others and identify ways to improve Frontier AI safety
WHICH LEADS to improvement in the Constraint Layer
ASSUMING the report and research results lead to improvements in Frontier AI systems.
I’ll test the riskiest assumption by learning how to report AI safety errors and impact change in Frontier AI systems.

Concrete commitments Link to heading

What are the 3 highest priority actions for you to take over the next few weeks?
1. Review the feasability of this study, refine and update. If everything looks good, then:
  1. Manually test the most widely used systems and report the results.
  2. Setup local LLM environment.
  3. Run first round of benchmarks.
  4. Determine method to test improving harmful outcomes.
  5. Implement.
  6. Test again.
When will you do them?
- Over the next 2 weeks, before the next round of BlueDot courses (Technical AI Safety and Frontier AI Governance) start.

Optional: public declaration Link to heading

After completing your plan, write a short 300 word post describing your intentions, and share it on your social media and/or with your friends/colleagues as a public commitment.

I will complete this step after I have reviewed and refined the project.