Asma Ghandeharioun

I am Asma Ghandeharioun, a senior research scientist in Google DeepMind. I work on aligning AI with human values through better understanding [1] and controlling (language) models [2], and uniquely by demystifying their inner workings [3] and correcting collective misconceptions along the way [4, 5].

I received my Ph.D. from the Affective Computing Group, MIT Media Lab. I am fortunate to have had Roz as my advisor. In addition, I have had research experiences at Google Research, Microsoft Research, and EPFL, many of which have evolved into exciting long-term collaborations.

In a previous life, I conducted research in the digital mental health and wellbeing space, collaborated with medical professionals from Harvard and renowned hospitals in the Boston area such as Massachusetts General Hospital (MGH) and Brigham and Women’s Hospital (BWH), and published in venues such as Frontiers in Psychiatry and Psychology of Well-being. This area remains close to my heart, and I occasionally dabble in it during my free time.

You can download my résumé here.

Selected Publications

Racing Thoughts: Explaining Large Language Model Contextualization Errors

Michael Lepori , Michael Mozer , and Asma Ghandeharioun

Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), 2025

(Oral)

Abs arXiv Bib

The profound success of transformer-based language models can largely be attributed to their ability to integrate relevant contextual information from an input sequence in order to generate a response or complete a task. However, we know very little about the algorithms that a model employs to implement this capability, nor do we understand their failure modes. For example, given the prompt "John is going fishing, so he walks over to the bank. Can he make an ATM transaction?", a model may incorrectly respond "Yes" if it has not properly contextualized "bank" as a geographical feature, rather than a financial institution. We propose the LLM Race Conditions Hypothesis as an explanation of contextualization errors of this form. This hypothesis identifies dependencies between tokens (e.g., "bank" must be properly contextualized before the final token, "?", integrates information from "bank"), and claims that contextualization errors are a result of violating these dependencies. Using a variety of techniques from mechanistic intepretability, we provide correlational and causal evidence in support of the hypothesis, and suggest inference-time interventions to address it.
@article{lepori2025racing, title = {Racing Thoughts: Explaining Large Language Model Contextualization Errors}, author = {Lepori, Michael and Mozer, Michael and Ghandeharioun, Asma}, year = {2025}, journal = {Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL)}, }
Who’s asking? User personas and the mechanics of latent misalignment

Advances in Neural Information Processing Systems (NeurIPS), 2024

(Spotlight)

Abs arXiv Bib DATA Website

Studies show that safety-tuned models may nevertheless divulge harmful information. In this work, we show that whether they do so depends significantly on who they are talking to, which we refer to as user persona. In fact, we find manipulating user persona to be more effective for eliciting harmful content than certain more direct attempts to control model refusal. We study both natural language prompting and activation steering as intervention methods and show that activation steering is significantly more effective at bypassing safety filters.We shed light on the mechanics of this phenomenon by showing that even when model generations are safe, harmful content can persist in hidden representations and can be extracted by decoding from earlier layers. We also show we can predict a persona’s effect on refusal given only the geometry of its steering vector. Finally, we show that certain user personas induce the model to form more charitable interpretations of otherwise dangerous queries.
@article{ghandeharioun2024persona, title = {Who's asking? User personas and the mechanics of latent misalignment}, author = {}, year = {2024}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, }
Patchscopes: A unifying framework for inspecting hidden representations of language models

Asma Ghandeharioun*, Avi Caciularu* , Adam Pearce , Lucas Dixon , and Mor Geva

International Conference on Machine Learning (ICML), 2024

Abs arXiv Bib Code Website

Understanding the internal representations of large language models (LLMs) can help explain models’ behavior and verify their alignment with human values. Given the capabilities of LLMs in generating human-understandable text, we propose leveraging the model itself to explain its internal representations in natural language. We introduce a framework called Patchscopes and show how it can be used to answer a wide range of questions about an LLM’s computation. We show that many prior interpretability methods based on projecting representations into the vocabulary space and intervening on the LLM computation can be viewed as instances of this framework. Moreover, several of their shortcomings such as failure in inspecting early layers or lack of expressivity can be mitigated by Patchscopes. Beyond unifying prior inspection techniques, Patchscopes also opens up new possibilities such as using a more capable model to explain the representations of a smaller model, and self-correction in multihop reasoning, even outperforming chain-of-thought prompting.
@article{ghandeharioun2024patchscopes, title = {Patchscopes: A unifying framework for inspecting hidden representations of language models}, author = {Ghandeharioun*, Asma and Caciularu*, Avi and Pearce, Adam and Dixon, Lucas and Geva, Mor}, year = {2024}, journal = {International Conference on Machine Learning (ICML)}, }
Interpretability illusions in the generalization of simplified models

Dan Friedman , Andrew Kyle Lampinen , Lucas Dixon , Danqi Chen , and Asma Ghandeharioun

International Conference on Machine Learning (ICML), 2024

Abs arXiv Bib

A common method to study deep learning systems is to create simplified representations—for example, using singular value decomposition to visualize the model’s hidden states in a lower dimensional space. This approach assumes that the simplified model is faithful to the original model. Here, we illustrate an important caveat to this assumption: even if a simplified representation of the model can accurately approximate the original model on the training set, it may fail to match its behavior out of distribution; the understanding developed from simplified representations may be an illusion. We illustrate this by training Transformer models on controlled datasets with systematic generalization splits, focusing on the Dyck balanced-parenthesis languages. We simplify these models using tools like dimensionality-reduction and clustering, and find clear patterns in the resulting representations. We then explicitly test how these simplified proxy models match the original models behavior on various out-of-distribution test sets. Generally, the simplified proxies are less faithful out of distribution. For example, in cases where the original model generalizes to novel structures or deeper depths, the simplified model may fail to generalize, or may generalize too well. We then show the generality of these results: even model simplifications that do not directly use data can be less faithful out of distribution, and other tasks can also yield generalization gaps. Our experiments raise questions about the extent to which mechanistic interpretations derived using tools like SVD can reliably predict what a model will do in novel situations.
@article{friedman2024interpretability, title = {Interpretability illusions in the generalization of simplified models}, author = {Friedman, Dan and Lampinen, Andrew Kyle and Dixon, Lucas and Chen, Danqi and Ghandeharioun, Asma}, journal = {International Conference on Machine Learning (ICML)}, year = {2024}, }
Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models

Peter Hase , Mohit Bansal , Been Kim , and Asma Ghandeharioun

Advances in Neural Information Processing Systems (NeurIPS), 2023

(Spotlight)

Abs Bib PDF Code

Language models learn a great quantity of factual information during pretraining, and recent work localizes this information to specific model weights like mid-layer MLP weights. In this paper, we find that we can change how a fact is stored in a model by editing weights that are in a different location than where existing methods suggest that the fact is stored. This is surprising because we would expect that localizing facts to specific model parameters would tell us where to manipulate knowledge in models, and this assumption has motivated past work on model editing methods. Specifically, we show that localization conclusions from representation denoising (also known as Causal Tracing) do not provide any insight into which model MLP layer would be best to edit in order to override an existing stored fact with a new one. This finding raises questions about how past work relies on Causal Tracing to select which model layers to edit. Next, we consider several variants of the editing problem, including erasing and amplifying facts. For one of our editing problems, editing performance does relate to localization results from representation denoising, but we find that which layer we edit is a far better predictor of performance. Our results suggest, counterintuitively, that better mechanistic understanding of how pretrained language models work may not always translate to insights about how to best change their behavior.
@article{hase2023localization, title = {Does localization inform editing? surprising differences in causality-based localization vs. knowledge editing in language models}, author = {Hase, Peter and Bansal, Mohit and Kim, Been and Ghandeharioun, Asma}, journal = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {36}, year = {2023}, }

Do machine learning models memorize or generalize

Adam Pearce , Asma Ghandeharioun, Nada Hussein , Nithum Thain , Martin Wattenberg , and Lucas Dixon

In IEEE VISxAI , 2023

(Best paper)

Bib Code Website

@inproceedings{pearce2023machine,
  title = {Do machine learning models memorize or generalize},
  author = {Pearce, Adam and Ghandeharioun, Asma and Hussein, Nada and Thain, Nithum and Wattenberg, Martin and Dixon, Lucas},
  booktitle = {IEEE VISxAI},
  year = {2023},
}

Post Hoc Explanations of Language Models Can Improve Language Models

Satyapriya Krishna , Jiaqi Ma , Dylan Slack , Asma Ghandeharioun, Sameer Singh , and Himabindu Lakkaraju

In Advances in Neural Information Processing Systems (NeurIPS) , 2023

Abs Bib PDF

Large Language Models (LLMs) have demonstrated remarkable capabilities in performing complex tasks. Moreover, recent research has shown that incorporating human-annotated rationales (e.g., Chain-of-Thought prompting) during in-context learning can significantly enhance the performance of these models, particularly on tasks that require reasoning capabilities. However, incorporating such rationales poses challenges in terms of scalability as this requires a high degree of human involvement. In this work, we present a novel framework, Amplifying Model Performance by Leveraging In-Context Learning with Post Hoc Explanations (AMPLIFY), which addresses the aforementioned challenges by automating the process of rationale generation. To this end, we leverage post hoc explanation methods which output attribution scores (explanations) capturing the influence of each of the input features on model predictions. More specifically, we construct automated natural language rationales that embed insights from post hoc explanations to provide corrective signals to LLMs. Extensive experimentation with real-world datasets demonstrates that our framework, AMPLIFY, leads to prediction accuracy improvements of about 10-25% over a wide range of tasks, including those where prior approaches which rely on human-annotated rationales such as Chain-of-Thought prompting fall short. Our work makes one of the first attempts at highlighting the potential of post hoc explanations as valuable tools for enhancing the effectiveness of LLMs. Furthermore, we conduct additional empirical analyses and ablation studies to demonstrate the impact of each of the components of AMPLIFY, which, in turn, lead to critical insights for refining in context learning.
@inproceedings{krishna2023post, title = {Post Hoc Explanations of Language Models Can Improve Language Models}, author = {Krishna, Satyapriya and Ma, Jiaqi and Slack, Dylan and Ghandeharioun, Asma and Singh, Sameer and Lakkaraju, Himabindu}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, year = {2023}, }
DISSECT: Disentangled simultaneous explanations via concept traversals

Asma Ghandeharioun, Been Kim , Chun-Liang Li , Brendan Jou , Brian Eoff , and Rosalind W Picard

In International Conference on Learning Representations (ICLR) , 2021

Abs Bib DATA PDF Code

Explaining deep learning model inferences is a promising venue for scientific understanding, improving safety, uncovering hidden biases, evaluating fairness, and beyond, as argued by many scholars. One of the principal benefits of counterfactual explanations is allowing users to explore "what-if" scenarios through what does not and cannot exist in the data, a quality that many other forms of explanation such as heatmaps and influence functions are inherently incapable of doing. However, most previous work on generative explainability cannot disentangle important concepts effectively, produces unrealistic examples, or fails to retain relevant information. We propose a novel approach, DISSECT, that jointly trains a generator, a discriminator, and a concept disentangler to overcome such challenges using little supervision. DISSECT generates Concept Traversals (CTs), defined as a sequence of generated examples with increasing degrees of concepts that influence a classifier’s decision. By training a generative model from a classifier’s signal, DISSECT offers a way to discover a classifier’s inherent "notion" of distinct concepts automatically rather than rely on user-predefined concepts. We show that DISSECT produces CTs that (1) disentangle several concepts, (2) are influential to a classifier’s decision and are coupled to its reasoning due to joint training (3), are realistic, (4) preserve relevant information, and (5) are stable across similar inputs. We validate DISSECT on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that it performs consistently well. Finally, we present experiments showing applications of DISSECT for detecting potential biases of a classifier and identifying spurious artifacts that impact predictions.
@inproceedings{ghandeharioun2021dissect, title = {DISSECT: Disentangled simultaneous explanations via concept traversals}, author = {Ghandeharioun, Asma and Kim, Been and Li, Chun-Liang and Jou, Brendan and Eoff, Brian and Picard, Rosalind W}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2021}, }
Approximating interactive human evaluation with self-play for open-domain dialog systems

Asma Ghandeharioun*, Judy Hanwen Shen* , Natasha Jaques* , Craig Ferguson , Noah Jones , Agata Lapedriza , and Rosalind W Picard

In Advances in Neural Information Processing Systems (NeurIPS) , 2019

Abs Bib PDF Code

Building an open-domain conversational agent is a challenging problem. Current evaluation methods, mostly post-hoc judgments of static conversation, do not capture conversation quality in a realistic interactive context. In this paper, we investigate interactive human evaluation and provide evidence for its necessity; we then introduce a novel, model-agnostic, and dataset-agnostic method to approximate it. In particular, we propose a self-play scenario where the dialog system talks to itself and we calculate a combination of proxies such as sentiment and semantic coherence on the conversation trajectory. We show that this metric is capable of capturing the human-rated quality of a dialog model better than any automated metric known to-date, achieving a significant Pearson correlation (r>.7, p<.05). To investigate the strengths of this novel metric and interactive evaluation in comparison to state-of-the-art metrics and human evaluation of static conversations, we perform extended experiments with a set of models, including several that make novel improvements to recent hierarchical dialog generation architectures through sentiment and semantic knowledge distillation on the utterance level. Finally, we open-source the interactive evaluation platform we built and the dataset we collected to allow researchers to efficiently deploy and evaluate dialog models.
@inproceedings{ghandeharioun2019approximating, title = {Approximating interactive human evaluation with self-play for open-domain dialog systems}, author = {Ghandeharioun*, Asma and Shen*, Judy Hanwen and Jaques*, Natasha and Ferguson, Craig and Jones, Noah and Lapedriza, Agata and Picard, Rosalind W}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, volume = {32}, year = {2019}, }
Towards Human-Centered Optimality Criteria

Asma Ghandeharioun

Massachusetts Institute of Technology , 2021

Abs Bib PDF

Despite the transformational success of machine learning across various applications, examples of deployed models failing to recognize and support human-centered (HC) criteria are abundant. In this thesis, I conceptualize the space of human-machine collaboration with respect to two components: interpretation of people by machines and interpretation of machines by people. I develop several tools that make improvements along these axes. First, I develop a pipeline that predicts depressive symptoms rated by clinicians from real-world longitudinal data outperforming several baselines. Second, I introduce a novel, model-agnostic, and dataset-agnostic method to approximate interactive human evaluation in open-domain dialog through self-play that is more strongly correlated with human evaluations than other automated metrics commonly used today. While dialog quality evaluation metrics predominantly use word-level overlap or distance metrics based on embedding resemblance to each turn of the conversation, I show the significance of taking into account the conversation’s trajectory and using proxies such as sentiment, semantics, and user engagement that are psychologically motivated. Third, I demonstrate an uncertainty measurement technique that helps disambiguate annotator disagreement and data bias. I show that this characterization also improves model performance. Finally, I present a novel method that allows humans to investigate a predictor’s decision-making process to gain better insight into how it works. The method jointly trains a generator, a discriminator, and a concept disentangler, allowing the human to ask "what-if" questions. I evaluate it on several challenging synthetic and realistic datasets where previous methods fall short of satisfying desirable criteria for interpretability and show that our method performs consistently well across all. I discuss its applications to detect potential biases of a classifier and identify spurious artifacts that impact predictions using simulated experiments. Together, these novel techniques and insights provide a more comprehensive interpretation of people by machines and more powerful tools for interpretation of machines by people that can move us closer to HC optimality.
@phdthesis{ghandeharioun2021towards, title = {Towards Human-Centered Optimality Criteria}, author = {Ghandeharioun, Asma}, year = {2021}, school = {Massachusetts Institute of Technology}, }