Self-Explanation in Social AI Agents: A Hybrid Knowledge-Generative AI Approach

1. Introduction & Overview

This paper addresses a critical challenge in the deployment of Social AI agents, particularly in sensitive domains like online education. The authors focus on SAMI (Social Agent Mediated Interaction), an AI assistant designed to foster social connections among learners in large-scale online classes. While such agents can mitigate the well-documented issue of low social presence, they introduce a new problem: opacity. Students interacting with SAMI naturally question how and why it makes specific recommendations (e.g., connecting two learners). The core research question is: How can an AI social assistant provide transparent, understandable explanations of its internal reasoning to build user trust?

The proposed solution is a novel self-explanation technique. This is framed as a natural language question-answering process where the agent introspects on a structured self-model of its own goals, knowledge, and methods. The key innovation is a hybrid architecture that marries the structured, interpretable representations of knowledge-based AI with the flexible, natural language generation capabilities of generative AI (specifically, ChatGPT).

2. Core Methodology & Architecture

The self-explanation pipeline is a multi-stage process designed to translate internal agent logic into user-friendly narratives.

2.1. The Self-Model: Task, Method, Knowledge (TMK) Framework

The foundation of self-explanation is a computable self-model. The authors adapt the TMK framework, where an agent's functionality is decomposed into:

Tasks (T): High-level objectives (e.g., "Increase social connectedness").
Methods (M): Procedures or algorithms to achieve tasks (e.g., "Find learners with shared interests").
Knowledge (K): Data or beliefs used by methods (e.g., "Learner A's interest: Machine Learning").

A critical adaptation is the representation of TMK elements not as formal logical propositions but as short natural language descriptions. This bridges the gap between the agent's symbolic structure and the generative model's language space.

2.2. Hybrid Explanation Generation: Combining Knowledge-Based and Generative AI

The explanation generation process involves five key steps:

Input: User poses a natural language question (e.g., "Why did you connect me with Alex?").
Retrieval: A similarity search is performed between the question and the English descriptions in the TMK self-model to identify the most relevant snippets of self-knowledge.
Introspection: A Chain of Thought (CoT) process is employed to "walk through" the relevant parts of the TMK model, reconstructing the logical steps the agent took.
Generation: The structured CoT output and retrieved knowledge snippets are formatted into a prompt for a large language model (ChatGPT).
Output: ChatGPT generates a coherent, natural language explanation delivered back to the user.

This hybrid approach leverages the precision and verifiability of the knowledge-based self-model to ground the explanation, while using generative AI for the fluency and adaptability of the final narrative.

3. Technical Implementation & Details

3.1. Mathematical Formulation of Similarity Search

The retrieval step is crucial for efficiency. Given a user query $q$ and a set of $N$ TMK description vectors $\{d_1, d_2, ..., d_N\}$ (e.g., from a sentence embedding model like Sentence-BERT), the system retrieves the top-$k$ most relevant descriptions. The relevance score is typically computed using cosine similarity:

$\text{similarity}(q, d_i) = \frac{q \cdot d_i}{\|q\| \|d_i\|}$

where $q$ and $d_i$ are vector representations in a shared semantic space. The top-$k$ descriptions with the highest similarity scores are passed to the next stage. This ensures the explanation is focused on the agent's reasoning relevant to the query, not its entire model.

3.2. Chain of Thought Prompting for Introspection

The CoT process transforms the retrieved TMK snippets into a structured reasoning trace. For a retrieved task $T_1$, method $M_1$, and knowledge items $K_1, K_2$, the CoT prompt might be engineered as:

"The agent's goal (Task) was: [T_1 description].
To achieve this, it used a method: [M_1 description].
This method required knowing: [K_1 description] and [K_2 description].
Therefore, the agent's decision was based on..."

This structured trace is then fed to ChatGPT with an instruction like: "Based on the following structured reasoning steps, generate a clear, concise explanation for a student."

4. Experimental Evaluation & Results

4.1. Evaluation Metrics: Completeness & Correctness

The authors evaluated the self-explanations along two primary dimensions:

Completeness: Does the explanation cover all relevant steps in the agent's decision process as defined by the TMK model? This was assessed by mapping explanation content back to the TMK elements.
Correctness: Does the explanation accurately reflect the agent's actual process, without introducing hallucinations or contradictions? This required expert verification against the agent's code/logs.

Key Evaluation Insight

The hybrid approach showed high scores in correctness because the generative model was tightly constrained by the retrieved TMK data. Completeness was more variable, depending on the quality of the similarity search and the prompt engineering for CoT.

4.2. Results from Live Class Deployment

The system was deployed in a live online class. While specific quantitative results are not detailed in the provided excerpt, the paper reports on this deployment, suggesting a focus on qualitative or preliminary real-world validation. The deployment itself is a significant result, demonstrating the practical feasibility of the approach in a dynamic educational environment. Future work would benefit from A/B testing measuring trust metrics (e.g., user surveys on perceived transparency, reliability) between groups receiving explanations and those who do not.

Hypothetical Chart Description: A bar chart comparing "Explanation Quality" scores (Completeness and Correctness on a 1-5 scale) for the Hybrid TMK+ChatGPT method versus a baseline of using ChatGPT alone with only the user query. The hybrid method would show significantly higher bars for Correctness, demonstrating the grounding effect of the self-model.

5. Analysis Framework & Example Case

Scenario: SAMI connects Student Bob (interested in "Jazz Music" and "Python Programming") with Student Alice (interested in "Blues Music" and "Data Science").

User Query: "Why was I connected to Alice?"

Internal TMK Self-Model Snippets (Retrieved):

T1: "Foster social connections based on profile similarity."
M1: "Calculate interest overlap using Jaccard similarity on interest tags."
K1 (Bob): "Interests: Jazz Music, Python Programming."
K2 (Alice): "Interests: Blues Music, Data Science."
K3 (Calculation): "Shared tags: 'Music' (from Jazz/Blues) and 'Programming' (from Python/Data Science). Similarity score: 0.4."

Chain of Thought Prompt for ChatGPT:
"The agent's task was to foster connections based on profile similarity. Its method was to calculate interest overlap. For you (Bob), it knew your interests: Jazz Music, Python Programming. For Alice, it knew her interests: Blues Music, Data Science. It calculated a similarity score based on overlapping categories (Music, Programming). The score was 0.4, which met its threshold for suggesting a connection. Therefore..."

Final Generated Explanation:
"I connected you with Alice because our system identified shared interests that could be a good basis for a conversation. Both of you have interests in music (Jazz/Blues) and technical topics related to programming (Python/Data Science). Based on this overlap, I thought you might have common ground to start a discussion."

6. Critical Analysis & Expert Insights

Core Insight: This paper isn't just about making AI explainable; it's a strategic blueprint for engineering trust in socially embedded agents. The authors correctly identify that in domains like education, the agent's influence comes not from raw task performance but from its role as a credible social actor. Their hybrid approach—using a symbolic self-model as a "source of truth" to jailbreak generative AI's tendency to confabulate—is a pragmatic and necessary hack in the current LLM era. It directly addresses what researchers like Cynthia Rudin argue: we need inherently interpretable models, not post-hoc explanations. Here, the TMK model provides that inherent structure.

Logical Flow & Contribution: The logic is compelling: 1) Social agents need trust, 2) Trust requires transparency, 3) Transparency requires self-explanation, 4) Reliable self-explanation requires a grounded self-model, 5) Usable explanations require natural language, 6) Therefore, combine a grounded model (TMK) with a language generator (LLM). The key contribution is the specific architecture that operationalizes this flow, particularly the use of similarity search over naturalized TMK descriptions as the retrieval mechanism. This is more elegant than hard-coded rule triggers.

Strengths & Flaws: The major strength is its practical hybrid design, avoiding the opacity of pure deep learning and the brittleness of pure symbolic systems. It's a clever application of retrieval-augmented generation (RAG) principles, but applied to self-knowledge rather than external documents—a concept with legs. However, the flaws are significant. First, the self-model is static and handcrafted. It doesn't learn or update from interactions, creating a maintenance burden and risk of drift from the actual agent code. Second, the evaluation is thin. Where are the hard numbers on user trust, comprehension, or behavioral change? Without these, it's an engineering proof-of-concept, not a validated trust-building tool. Third, it assumes the TMK model is a perfect representation of the agent's "true" reasoning, which may not hold for complex, adaptive agents.

Actionable Insights: For practitioners, the takeaway is clear: Start architecting your AI systems with a queryable self-model from day one. This paper provides a viable template. The next step is to automate the creation and updating of this self-model, perhaps using techniques from neuro-symbolic AI or mechanistic interpretability. For researchers, the challenge is to move beyond static self-models to dynamic, learnable self-representations. Can an agent learn its own TMK structure from its experiences and code? Furthermore, the field must develop standardized benchmarks for evaluating the socio-cognitive impact of explanations, not just their technical completeness. Does an explanation like the one generated actually increase a learner's willingness to engage with a peer suggested by the AI? That's the ultimate metric that matters.

7. Future Applications & Research Directions

Automated Self-Model Learning: Integrating techniques from program synthesis or LLM-based code analysis to automatically generate and update the TMK self-model from the agent's source code and runtime logs, reducing manual engineering.
Explainable Multi-Agent Systems: Extending the framework to explain the behavior of agent collectives or swarms, where explanations may involve coordination protocols and emergent behavior.
Personalized Explanation Styles: Adapting the generative component to tailor explanation complexity, tone, and focus based on individual user profiles (e.g., novice vs. expert, skeptical vs. trusting).
Proactive & Contrastive Explanations: Moving beyond reactive QA to having the agent proactively offer explanations for unexpected actions or provide contrastive explanations ("I connected you with Alice instead of Charlie because...").
Application in High-Stakes Domains: Deploying similar self-explanation architectures in healthcare AI (explaining treatment recommendations), fintech (explaining loan denials), or autonomous systems (explaining navigation decisions), where transparency is legally or ethically mandated.
Trust Calibration Research: Longitudinal studies to measure how exposure to such explanations over time affects user trust, reliance, and overall system efficacy in achieving its social goals.

8. References

Goel, A. K., & Joyner, D. A. (2017). Using AI to teach AI: Lessons from an online AI class. AI Magazine.
Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence.
Wei, J., et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems.
Muller, M., et al. (2019). Principles for Explainable AI. Communications of the ACM.
Confalonieri, R., et al. (2021). A historical perspective of explainable AI. WIREs Data Mining and Knowledge Discovery.
Goodfellow, I., et al. (2014). Generative Adversarial Nets. Advances in Neural Information Processing Systems. (As an example of a foundational, yet often opaque, AI technique that necessitates post-hoc explanation methods).
Georgia Institute of Technology, Interactive Computing - Design & Intelligence Lab. (https://dilab.gatech.edu/) – For context on the research environment producing this work.
OpenAI. (2023). ChatGPT. (https://openai.com/chatgpt) – The generative AI component referenced in the paper.