Here's a summary of her public research and outreach:
Published Research (key papers)
The WMDP Benchmark — measuring and reducing malicious use of AI through unlearning (2024)
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (2024)
Aligned LLMs Are Not Aligned Browser Agents (2025) — ironically relevant given the OpenClaw incident
A Careful Examination of LLM Performance on Grade School Arithmetic (2024)
Planning in Natural Language Improves LLM Search for Code Generation (2025)
Work published at NeurIPS and ICLR conferences, often in collaboration with Dan Hendrycks (Center for AI Safety).
Background
Computer Science + Economics from UPenn's Jerome Fisher M&T / Wharton program
Started in software engineering (YouTube Trust & Safety, Google Brain, DeepMind)
Founded Scale AI's SEAL lab — focused on private, tamper-proof LLM benchmarks (the SEAL Leaderboards)
Partnered with Center for AI Safety on the WMDP benchmark
Public Outreach
Scheduled speaker at SXSW 2025: "Beyond the Hype: Building Reliable and Trustworthy AI"
Active on Twitter/X sharing research
Now at Meta Superintelligence focusing on alignment
The OpenClaw incident is being widely covered today as a real-world example of the exact problems her research warns about — agentic AI losing context and acting beyond intended scope.
I worry about this far more than some sort of malevolent AI; which is extremely unlikely so long as LLMs have no individual agency.
But these idiots are sprinting to turn over more and more important systems to LLM authority with very little or no control (or understanding) over the underlying "logic" of these systems.
God help us if they, in their infinite wisdom, decide to create a "smart" electrical grid or traffic system.
From a machine base search:
Here's a summary of her public research and outreach: Published Research (key papers)
The WMDP Benchmark — measuring and reducing malicious use of AI through unlearning (2024)
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet (2024)
Aligned LLMs Are Not Aligned Browser Agents (2025) — ironically relevant given the OpenClaw incident
A Careful Examination of LLM Performance on Grade School Arithmetic (2024)
Planning in Natural Language Improves LLM Search for Code Generation (2025)
Work published at NeurIPS and ICLR conferences, often in collaboration with Dan Hendrycks (Center for AI Safety).
Background
Computer Science + Economics from UPenn's Jerome Fisher M&T / Wharton program
Started in software engineering (YouTube Trust & Safety, Google Brain, DeepMind)
Founded Scale AI's SEAL lab — focused on private, tamper-proof LLM benchmarks (the SEAL Leaderboards)
Partnered with Center for AI Safety on the WMDP benchmark
Public Outreach
Scheduled speaker at SXSW 2025: "Beyond the Hype: Building Reliable and Trustworthy AI" Active on Twitter/X sharing research Now at Meta Superintelligence focusing on alignment
The OpenClaw incident is being widely covered today as a real-world example of the exact problems her research warns about — agentic AI losing context and acting beyond intended scope.
I worry about this far more than some sort of malevolent AI; which is extremely unlikely so long as LLMs have no individual agency.
But these idiots are sprinting to turn over more and more important systems to LLM authority with very little or no control (or understanding) over the underlying "logic" of these systems.
God help us if they, in their infinite wisdom, decide to create a "smart" electrical grid or traffic system.