Commonsense Reasoning for Conversational AI: A Survey of the State of the Art

1. Introduction

This survey paper addresses the critical challenge of integrating commonsense reasoning into state-of-the-art conversational AI systems. While transformer-based models like BERT, GPT, and T5 have achieved remarkable success in understanding language syntax and contextual semantics, they still struggle with tasks requiring commonsense knowledge—knowledge about the world that humans typically take for granted. The paper argues that this gap significantly hinders the development of truly natural and coherent dialogue systems.

The importance of commonsense for machine intelligence has long been recognized, yet a universal scheme for codifying and integrating this knowledge remains elusive. This survey focuses on the intersection of commonsense reasoning and conversational AI, reviewing relevant datasets, methodologies, and evaluation benchmarks.

2. Commonsense Reasoning in Conversational AI Problems

Commonsense reasoning is crucial across various facets of conversational AI. The paper identifies several key problem areas where its absence is most apparent.

2.1 Dialogue Understanding

Models must infer unstated intentions, resolve ambiguities, and understand implicit context. For example, understanding that "I'm running to the store" implies a mode of transportation and an intent to purchase, not just physical movement.

2.2 Response Generation

Generating coherent, relevant, and socially appropriate responses requires knowledge of social norms, physical laws, and typical human behavior. A model lacking commonsense might generate physically impossible or socially awkward replies.

2.3 Task-Oriented Dialogue

Assisting users with tasks (e.g., booking travel, troubleshooting) requires reasoning about sequences of actions, cause-and-effect relationships, and the properties of objects in the world.

3. Methods for Integrating Commonsense

The survey categorizes primary approaches into three main strategies for incorporating commonsense into conversational AI models.

3.1 Model Fine-Tuning

This approach involves further training (fine-tuning) large pre-trained language models on datasets specifically curated for commonsense reasoning tasks. Datasets like SocialIQA, CommonsenseQA, and PIQA are used to adapt models to reason about social interactions, conceptual properties, and physical intuition.

3.2 Knowledge-Graph Grounding

This method explicitly incorporates structured external knowledge sources. The paper highlights two prominent knowledge graphs (KGs):

ConceptNet: A semantic network containing general world knowledge about words and phrases.
ATOMIC: A KG focused on inferential knowledge about everyday events, capturing "if-then" relations regarding causes, effects, and mental states of participants.

Models are designed to retrieve and reason over information from these KGs during dialogue processing. The COMET model, a transformer-based neural network trained on ConceptNet and ATOMIC, is cited as a key example capable of generating novel commonsense inferences.

3.3 Natural Language Explanations

An emerging approach involves training models not only to produce an answer but also to generate a natural language explanation that justifies the answer using commonsense. This aims to make the model's reasoning process more transparent and potentially more robust.

4. Benchmarks and Evaluation Metrics

Evaluating commonsense reasoning in dialogue is complex. The paper discusses several benchmarks:

Task-Specific Benchmarks: Dedicated datasets for evaluating specific reasoning skills (e.g., physical reasoning in PIQA, social reasoning in SocialIQA).
Integrated Dialogue Benchmarks: Evaluations within broader dialogue tasks, such as the Commonsense Dialogue dataset which tests if a model's responses are consistent with commonsense facts.
Human Evaluation: Ultimately, the naturalness and coherence of a dialogue, judged by humans, remains a critical, though subjective, metric.

Common automatic metrics include accuracy on multiple-choice questions, BLEU/ROUGE for response quality, and novel metrics designed to measure factual consistency or reasoning plausibility.

5. Preliminary Observations on SOTA Models

The paper presents preliminary analysis of two leading open-dialogue models: BlenderBot 3 and LaMDA. Despite their advanced capabilities, both models exhibit significant failures in commonsense reasoning. Examples include:

Generating responses that violate basic physical laws (e.g., suggesting an object can be in two places at once).
Failing to understand implicit social cues or norms.
Producing factually inconsistent statements within a single conversation turn.

These observations strongly motivate the need for focused research in this area, as such failures directly undermine user trust and the perceived naturalness of interactions.

Key Insight

Even the most advanced conversational models (BlenderBot3, LaMDA) demonstrate critical gaps in commonsense, highlighting it as a fundamental frontier, not a peripheral challenge.

6. Technical Details and Mathematical Formulation

The integration of knowledge graphs often involves a retrieval-augmented generation framework. Given a dialogue context $C$ and a knowledge graph $\mathcal{K}$, the model's objective can be framed as generating a response $R$ that maximizes:

$P(R | C, \mathcal{K}) = \sum_{k \in \mathcal{K}_C} P(k | C) \cdot P(R | C, k)$

Where $\mathcal{K}_C$ is a subset of relevant knowledge triples retrieved from $\mathcal{K}$ based on the context $C$. The term $P(k | C)$ represents the retrieval model's probability of selecting knowledge triple $k$, and $P(R | C, k)$ is the probability of the response given the context and the selected knowledge. Models like COMET implement this by fine-tuning a transformer (e.g., GPT-2) on knowledge graph triples formatted as $(head, relation, tail)$, enabling it to generate plausible $tail$ completions for novel $(head, relation)$ queries.

7. Analysis Framework: A Case Study

Scenario: Evaluating a chatbot's understanding of a simple narrative.

User Input: "I poured myself a glass of orange juice, but then the phone rang. When I came back, the glass was empty."

Analysis Framework:

Knowledge Retrieval: The system should retrieve relevant commonsense facts: Liquids can be consumed. Pets (like cats) can drink liquids. People answer phones.
Inference Generation: Using a model like COMET, generate possible inferences for the event "glass of juice left unattended": "If X leaves a drink unattended, then a pet might drink it" (ATOMIC relation: xEffect).
Hypothesis Scoring: Evaluate which inferred explanation ("someone drank it", "it evaporated", "a pet drank it") best fits the context and physical plausibility. The correct inference relies on unstated world knowledge about typical household events.
Response Formulation: Generate a coherent follow-up question or statement: "Oh no, did your cat get to it?" versus an implausible one: "Did it turn into gas?"

This framework highlights the multi-step reasoning required, moving from retrieval to inference to contextual integration.

8. Future Applications and Research Directions

The path forward for commonsense-aware conversational AI involves several key directions:

Multimodal Commonsense: Integrating visual, auditory, and sensory knowledge with language, as pioneered by models like OpenAI's CLIP and DALL-E, which link text with visual concepts. Future dialogue agents may need to reason about scenes described in conversation.
Dynamic Knowledge Graphs: Moving beyond static KGs to systems that can learn and update commonsense knowledge continuously from interactions, similar to how humans do.
Causal Reasoning: Deepening models' understanding of cause-and-effect, a core component of commonsense. Research from Judea Pearl's causal hierarchy suggests moving from association to intervention and counterfactual reasoning is crucial for robust AI.
Personalized and Cultural Commonsense: Developing models that understand commonsense norms that vary across individuals, communities, and cultures.
Neuro-Symbolic Integration: Combining the pattern recognition strength of neural networks (like transformers) with the explicit, logical reasoning capabilities of symbolic AI systems. This hybrid approach, as explored by MIT's Probabilistic Symbolic (PS) models, is a promising avenue for tractable and interpretable commonsense reasoning.

9. References

Richardson, C., & Heck, L. (2023). Commonsense Reasoning for Conversational AI: A Survey of the State of the Art. Workshop on Knowledge Augmented Methods for NLP, AAAI 2023.
Speer, R., Chin, J., & Havasi, C. (2017). ConceptNet 5.5: An Open Multilingual Graph of General Knowledge. Proceedings of AAAI.
Sap, M., et al. (2019). ATOMIC: An Atlas of Machine Commonsense for If-Then Reasoning. Proceedings of AAAI.
Bosselut, A., et al. (2019). COMET: Commonsense Transformers for Automatic Knowledge Graph Construction. Proceedings of ACL.
Gao, J., et al. (2018). Neural Approaches to Conversational AI. Foundations and Trends® in Information Retrieval.
Pearl, J., & Mackenzie, D. (2018). The Book of Why: The New Science of Cause and Effect. Basic Books.
Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision. Proceedings of ICML (CLIP).

Analyst's Perspective: The Commonsense Chasm

Core Insight: The survey by Richardson and Heck exposes a fundamental, yet often understated, truth in modern AI: our most sophisticated language models are brilliant pattern matchers operating in a semantic vacuum. They have mastered the "how" of language but lack the "why"—the foundational world model that grounds meaning. This isn't a minor technical bug; it's an architectural flaw that limits AI's utility and trustworthiness in real-world applications. As the authors note, even flagship models like LaMDA and BlenderBot3 fail on trivial human reasoning tasks, a gap that echoes the limitations observed in other AI domains, such as computer vision models that lack physical understanding despite their perceptual prowess.

Logical Flow & Strengths & Flaws: The paper's strength lies in its clear taxonomy—categorizing approaches into Fine-Tuning, KG-Grounding, and Explanations. This framework usefully segments a chaotic research landscape. The emphasis on Knowledge Graphs like ConceptNet and ATOMIC is apt; they represent the most concrete attempt to bottle the lightning of commonsense. However, the survey also inadvertently highlights the field's central weakness: a reliance on brittle, static, and inevitably incomplete knowledge bases. ConceptNet, while valuable, is a snapshot of consensus reality, lacking the dynamic, contextual, and often contradictory nature of real-world knowledge. The COMET model's approach of generating knowledge is a clever workaround, but it risks hallucinating plausible-sounding but incorrect "facts," trading one problem for another. The benchmarking discussion further reveals a meta-problem: we lack robust, automatic metrics for evaluating reasoning depth, often falling back on multiple-choice accuracy or shallow similarity scores, which are poor proxies for true understanding.

Actionable Insights: The path forward isn't merely scaling existing paradigms. First, the field must prioritize causal and counterfactual reasoning, moving beyond correlation. As Judea Pearl's work argues, understanding "what if" and "why" is the bedrock of robust intelligence. Second, we need a shift towards neuro-symbolic integration. Pure neural approaches are data-hungry and opaque; pure symbolic systems are brittle. Hybrid models, which leverage neural networks for perception and pattern matching alongside symbolic engines for logical deduction, offer a promising, though computationally challenging, path. Institutions like MIT's CSAIL are making strides here. Finally, evaluation must evolve. We need benchmarks that stress-test reasoning chains, require justification, and penalize contradictions, moving beyond single-turn tasks to multi-step dialogue narratives that expose logical inconsistencies. The future of conversational AI isn't just about better chat; it's about building machines that share our understanding of the world, a goal that remains tantalizingly out of reach but is now more clearly defined thanks to surveys like this one.