Akash Mukherjee

Realm Labs, Cofounder

Actions

Akash is the Cofounder at Realm Labs, an AI Safety startup, where he leads the engineering and technology. He is an AI security expert, researching in applying advanced ML techniques like mechanistic interpretability to solve the biggest problems with LLMs: Trust. In the past, Akash has led major security initiatives at Google and Apple. He is also a renowned author of Defense in Depth, a pragmatic guide to creating robust, layered security strategies.

[Track 1] Are Your LLM’s Safety Mechanisms Intact? Detecting Backdoors with White-Box Analysis

Most AI security evaluations today focus on surface-level behavior: benchmarks, red-team prompts, or judge models that label outputs as “safe” or “unsafe.” While useful, these approaches implicitly assume that correct behavior implies intact safety mechanisms. In this talk, I’ll show why that assumption can fail.
I’ll present hands-on experiments exploring a class of LLM backdoors that selectively weaken refusal behavior while continuing to appear compliant under standard evaluations. Instead of relying on black-box judgments, this work uses a white-box analysis approach: first identifying internal signals associated with refusal behavior, then examining how those signals change when a model is backdoored and triggered. The key observation is that safety can degrade internally even when outputs still look acceptable, making output-only testing insufficient for these threats.
The talk focuses on what this means for practitioners building and operating secure AI systems. I’ll discuss how white-box analysis can provide more transparent safety signals, where it fits in the AI/ML lifecycle (e.g., pre-deployment checks or model upgrades), and how it complements existing benchmarks and red-teaming. I’ll also cover practical limitations, and other possibilities of this technique.
Attendees should leave with a concrete understanding of how backdoors can target safety mechanisms themselves, why black-box evaluations can miss these failures, and how white-box analysis can improve transparency when assessing the integrity of LLM safety behavior.

Akash Mukherjee

Realm Labs, Cofounder

Actions

Please note that Sessionize is not responsible for the accuracy or validity of the data provided by speakers. If you suspect this profile to be fake or spam, please let us know.

Speaker

Akash Mukherjee

Actions

Links

Sessions

[Track 1] Are Your LLM’s Safety Mechanisms Intact? Detecting Backdoors with White-Box Analysis

Akash Mukherjee

Links

Actions