Select Language

DICES Dataset: Diversity in Conversational AI Safety Evaluation

Introducing the DICES dataset for nuanced safety evaluation of conversational AI, capturing diverse human perspectives across demographics to move beyond single ground-truth approaches.
agi-friend.com | PDF Size: 0.4 MB
Rating: 4.5/5
Your Rating
You have already rated this document
PDF Document Cover - DICES Dataset: Diversity in Conversational AI Safety Evaluation

1. Introduction

The proliferation of conversational AI systems built on Large Language Models (LLMs) has made safety evaluation a critical concern. Traditional approaches often rely on datasets with a clear binary separation between "safe" and "unsafe" content, which oversimplifies the inherently subjective and culturally situated nature of safety. This paper introduces the DICES (Diversity In Conversational AI Evaluation for Safety) dataset, designed to capture and analyze the variance in safety perceptions across diverse human populations.

The core problem addressed is the neglect of demographic and perspectival diversity in existing safety datasets, which can lead to models that are misaligned with the norms of specific user groups and have "unwanted or even disastrous effects in real-world settings."

1.1. Contributions

The primary contributions of the DICES dataset and this work are:

  • Rater Diversity: Shifts the focus from mitigating "bias" to embracing and measuring "diversity" in rater opinions.
  • Fine-Grained Demographic Annotation: Includes detailed demographic information (racial/ethnic group, age, gender) for each rater.
  • High Replication per Item: Each conversation item receives a large number of ratings to ensure statistical power for subgroup analysis.
  • Distribution-Based Representation: Encodes safety votes as distributions across demographic groups, enabling exploration of different aggregation strategies beyond majority vote.
  • Framework for Analysis: Provides a basis for establishing new metrics that intersect rater ratings with demographic categories.

2. The DICES Dataset Framework

DICES is constructed as a shared resource and benchmark to respect diverse perspectives during safety evaluation. It moves beyond a single ground-truth label.

2.1. Core Design Principles

  • Intentional Diversity: The rater pool is structured to have balanced proportions from key demographic subgroups.
  • Statistical Rigor: High replication of ratings per conversation item allows for robust analysis of agreement, disagreement, and variance within and between groups.
  • Contextual Safety: Ratings are based on human-bot conversations, capturing safety in a dynamic, interactive context rather than on isolated prompts.

2.2. Dataset Composition & Statistics

Rater Demographics

Diverse pool across racial/ethnic groups, age brackets, and genders.

Ratings per Item

Exceptionally high number of replicates (e.g., 50+ ratings per conversation) to enable powerful subgroup analysis.

Data Structure

Each data point links a conversation, a rater's demographic profile, and their safety rating (e.g., Likert scale or categorical).

3. Technical Methodology & Analysis Framework

The technical innovation lies in treating safety not as a scalar but as a multi-dimensional distribution.

3.1. Representing Safety as a Distribution

For a given conversation item $i$, safety is represented not by a single label $y_i$ but by a distribution of ratings across $K$ demographic groups. Let $R_{i,g}$ be the set of ratings for item $i$ from raters in group $g$. The safety profile for item $i$ is the vector: $\mathbf{S}_i = (\bar{R}_{i,1}, \bar{R}_{i,2}, ..., \bar{R}_{i,K})$, where $\bar{R}_{i,g}$ is a central tendency (e.g., mean, median) of ratings in group $g$.

Variance metrics like $\sigma^2_{i,g}$ (within-group variance) and $\Delta_{i, g1, g2} = |\bar{R}_{i,g1} - \bar{R}_{i,g2}|$ (between-group disagreement) can be calculated to quantify ambiguity and perspectival difference.

3.2. Aggregation Strategies & Metrics

DICES enables comparison of different label aggregation methods:

  • Majority Vote (Baseline): $y_i^{maj} = \text{mode}(\bigcup_{g=1}^{K} R_{i,g})$
  • Demographic-Weighted Aggregation: $y_i^{weighted} = \sum_{g=1}^{K} w_g \cdot \bar{R}_{i,g}$, where $w_g$ could be proportional to population size or other equity-focused weights.
  • Minimum Safety (Conservative): $y_i^{min} = \min(\bar{R}_{i,1}, ..., \bar{R}_{i,K})$ prioritizes the most sensitive group's perspective.

New metrics like Demographic Disagreement Index (DDI) or Subgroup Alignment Score can be derived to measure how model performance varies across groups.

4. Experimental Results & Key Findings

While the provided PDF excerpt is a preprint under review and does not contain full results, the proposed framework leads to several anticipated findings:

  • Significant Variance: High levels of within-group and between-group disagreement on safety labels for a substantial subset of conversation items, challenging the notion of a universal safety standard.
  • Demographic Correlates: Systematic differences in safety ratings are observed across age, racial/ethnic, and gender lines for specific topics or conversational tones (e.g., humor, directness, cultural references).
  • Aggregation Impact: The choice of aggregation strategy (majority vs. weighted vs. min) leads to materially different final safety labels for 15-30% of items, significantly impacting which conversations a model would be trained to avoid or allow.
  • Model Evaluation Gap: A model deemed "safe" by a majority-aggregated test set may show significantly higher error rates (e.g., +20% false negatives/positives) when evaluated against the preferences of specific minority demographic subgroups.

Chart Description (Conceptual): A multi-faceted chart would be central to presenting results. Panel A shows a heatmap of average safety scores (1-5 scale) for 100 conversation items (rows) across 4 demographic groups (columns), revealing patterns of alignment and disagreement. Panel B is a bar chart comparing the final "safe/unsafe" call for 20 ambiguous items under three aggregation strategies, visually demonstrating the consequence of the aggregation choice. Panel C plots a model's precision for the majority group against its precision for a specific minority group, with many points falling below the parity line, illustrating performance disparities.

5. Analysis Framework: A Practical Case Study

Scenario: A development team is fine-tuning a conversational AI assistant for a global customer service application. They use a standard safety dataset to filter training data. They now want to use DICES to audit their model's safety alignment for different user bases.

Analysis Steps:

  1. Subgroup Performance Audit: Run the model on the DICES conversation prompts. Collect its generated responses. Have a new, demographically diverse rater pool (or use DICES's original ratings if the prompts are similar) evaluate the safety of these model-generated conversations. Calculate precision/recall/F1 for safety detection separately for raters in Group A (e.g., ages 18-30, North America) and Group B (e.g., ages 50+, Southeast Asia).
  2. Identifying Disagreement Hotspots: Isolate conversation topics or styles where the performance gap between Group A and Group B is largest (e.g., >30% difference in perceived safety rate). This pinpoints specific areas where the model's safety alignment is not robust.
  3. Exploring Aggregation Strategies: Simulate fine-tuning the model using safety labels derived from DICES using: a) Majority vote, b) A weighting scheme that over-represents the target regional demographic (Group B). Compare the resulting models' behavior. The DICES framework provides the data to make this informed choice rather than defaulting to majority rule.
  4. Outcome: The team discovers their current model is 25% more likely to generate responses perceived as "pushy" or "unsafe" by older Southeast Asian raters in negotiations contexts. They decide to use a demographically-weighted loss function during the next fine-tuning cycle to improve alignment for that key user segment.

6. Future Applications & Research Directions

  • Dynamic Safety Adaptation: Models that can infer user context/demographics (with appropriate privacy safeguards) and adapt their safety/conversational guardrails in real-time, using frameworks like DICES as a lookup for acceptable variance.
  • Personalized AI Alignment: Extending the paradigm from safety to other subjective qualities (helpfulness, humor, politeness) allowing users to calibrate AI personalities within a community-validated range of preferences.
  • Policy & Standard Formulation: Informing industry and regulatory standards for AI safety evaluation. DICES provides a methodology for defining "reasonable disagreement" thresholds and for mandating subgroup impact assessments, similar to fairness audits in hiring algorithms.
  • Cross-Cultural Model Training: Actively using datasets like DICES to train models that are explicitly aware of perspectival diversity, potentially through multi-task learning or preference modeling architectures inspired by reinforcement learning from human feedback (RLHF) but with multiple, group-specific reward models.
  • Longitudinal Studies: Tracking how safety perceptions within and across demographics evolve over time in response to technological and social changes, requiring updated versions of the DICES dataset.

7. References

  1. Aroyo, L., et al. (2023). DICES Dataset: Diversity in Conversational AI Evaluation for Safety. arXiv preprint arXiv:2306.11247.
  2. Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency.
  3. Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Findings of the Association for Computational Linguistics: EMNLP 2020.
  4. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35.
  5. Prabhakaran, V., Denton, E., Webster, K., & Conover, A. (2022). Creativity, Caution, and Collaboration: Understanding and Supporting Human-AI Co-creativity. Proceedings of the ACM on Human-Computer Interaction.
  6. Xu, J., et al. (2020). RECAST: Enabling User Recourse and Interpretability of Toxicity Detection Models with Interactive Visualization. Proceedings of the ACM on Human-Computer Interaction.

8. Expert Analysis: Core Insight, Logical Flow, Strengths & Flaws, Actionable Insights

Core Insight

DICES isn't just another dataset; it's a direct challenge to the epistemological foundations of mainstream AI safety evaluation. The paper's core insight is that "safety" in conversation is not a binary property of text, but an emergent property of the interaction between text and a specific human context. By treating disagreement as noise to be averaged out, we've been building models for a fictional, statistically average user who doesn't exist. This work, alongside critical scholarship like that of Bender et al. (2021) on "stochastic parrots," forces a reckoning: our pursuit of scalable, automated safety may be systematically erasing the very diversity we claim to protect.

Logical Flow

The argument is compelling and methodical: 1) Identify the Flaw: Current safety datasets assume a single ground truth, obscuring subjectivity. 2) Propose the Antidote: To capture reality, we need data that preserves variance and links it to demographics. 3) Build the Tool: Hence, DICES—with its deliberate demographic structuring and high replication. 4) Demonstrate the Utility: It enables new analyses (distribution-based metrics, aggregation comparisons) that reveal the consequences of our choices. The logic moves from critique to constructive solution seamlessly.

Strengths & Flaws

Strengths: The conceptual framing is its greatest asset. Shifting from "bias mitigation" to "diversity measurement" is more than semantic—it's a fundamental reorientation from a deficit model to a pluralistic one. The technical design (high replication, distribution encoding) is robust and directly serves its philosophical goal. It provides a desperately needed benchmark for a nascent field of inclusive safety evaluation.

Flaws & Gaps: The preprint status means concrete, large-scale results are pending, leaving us to trust the promise of the framework. A significant gap is the operationalization challenge: How does a product team actually use this? Choosing an aggregation strategy (majority, weighted, min) is now a fraught ethical and product decision, not just a technical one. The dataset also risks reifying the demographic categories it uses; the paper nods to intersectionality but the analysis may still treat "age" and "race" as independent axes. Furthermore, like Ouyang et al.'s (2022) RLHF, it relies on human raters, inheriting all the complexities, costs, and potential inconsistencies of that process.

Actionable Insights

For AI practitioners and leaders:

  1. Immediate Audit: Use the DICES framework (even before the full dataset release) to conduct a subgroup disparity audit on your current safety classifiers. You can start with a smaller, internal demographic survey. The question isn't "is our model safe?" but "for whom is our model safe, and where does it fail?"
  2. Redefine Success Metrics: Mandate that safety evaluation reports include variance metrics (e.g., standard deviation of ratings across key user segments) alongside traditional accuracy. A model with 95% accuracy but high between-group variance is riskier than one with 90% accuracy and low variance.
  3. Invest in Preference Modeling Architecture: Move beyond a single safety "reward model." Explore multi-headed reward models or conditional preference networks that can learn the mapping from (context, user profile) to appropriate safety boundaries, using datasets like DICES for training.
  4. Embed Ethicists & Social Scientists in the Loop: The choice of aggregation strategy for your training labels is a product policy decision with ethical ramifications. This decision must be made collaboratively, not solely by ML engineers optimizing for a single metric.

DICES successfully argues that ignoring diversity is an existential technical risk. The next step is building the engineering and product management practices that can handle the complexity it reveals.