DICES Dataset: Diversity in Conversational AI Safety Evaluation

1. Introduction

The proliferation of conversational AI systems built on Large Language Models (LLMs) has made safety evaluation a critical concern. Traditional approaches often rely on datasets with a clear binary separation between "safe" and "unsafe" content, which inherently oversimplifies the subjective and culturally situated nature of safety. The DICES (Diversity In Conversational AI Evaluation for Safety) dataset, introduced by researchers from Google Research, City University of London, and the University of Cambridge, addresses this gap by providing a resource that captures the inherent variance, ambiguity, and diversity of human perspectives on AI safety.

DICES is designed with three core principles: 1) inclusion of fine-grained demographic information about raters (e.g., racial/ethnic group, age, gender), 2) high replication of ratings per conversational item to ensure statistical power, and 3) encoding of rater votes as distributions across demographics to enable exploration of different aggregation strategies. This design moves beyond a single "ground truth" and instead treats safety as a multi-faceted, population-dependent construct.

1.1. Contributions

The primary contributions of the DICES dataset and the accompanying research are:

Rater Diversity as a Core Feature: Shifting the focus from mitigating "bias" to embracing and analyzing "diversity" in rater opinions.
Framework for Fine-Grained Analysis: Providing a dataset structure that allows for in-depth exploration of how safety perceptions intersect with demographic categories.
Benchmark for Nuanced Evaluation: Establishing DICES as a shared resource for evaluating conversational AI systems in a way that respects diverse viewpoints, moving beyond monolithic safety scores.

2. Core Insight & Logical Flow

Core Insight: The fundamental flaw in mainstream AI safety evaluation isn't a lack of data, but a lack of representative and disaggregated data. Treating safety as an objective, binary classification task is a dangerous oversimplification that erases cultural nuance and can lead to systems that are "safe" only for a dominant demographic. DICES correctly identifies that safety is a social construct, and its evaluation must be statistical, not deterministic.

Logical Flow: The paper's argument is razor-sharp: 1) Current LLM safety fine-tuning relies on simplified datasets. 2) This simplification ignores subjective variance, which is particularly problematic for safety—a socially situated concept. 3) Therefore, we need a new class of dataset that captures this variance explicitly through demographic diversity and high rater replication. 4) DICES provides this, enabling analyses that reveal which groups find which content unsafe and to what degree. This flow logically dismantles the myth of a universal safety standard and replaces it with a framework for understanding safety landscapes.

3. Strengths & Flaws

Strengths:

Paradigm-Shifting Design: The move from binary labels to demographic distributions is its killer feature. It forces the field to confront the plurality of safety.
Statistical Rigor: High replication per item is non-negotiable for meaningful demographic analysis, and DICES gets this right. It provides the statistical power needed to move beyond anecdotes.
Actionable for Model Development: It doesn't just diagnose a problem; it provides a structure (distributions) that can directly inform more nuanced fine-tuning and evaluation metrics, akin to how uncertainty quantification improved model calibration.

Flaws & Open Questions:

The "Demographic Bottleneck": While it includes key demographics, the choice of categories (race, age, gender) is a starting point. It misses intersectionality (e.g., young Black women) and other axes like socioeconomic status, disability, or cultural geography, which are equally critical for a full picture.
Operationalization Challenge: The paper is light on the how. How exactly should a model developer use these distributions? Do you fine-tune to the mean? The mode? Or develop a system that can adapt its safety filter based on inferred user demographics? The step from rich data to engineering practice is the next cliff to scale.
Static Snapshot: Societal norms on safety evolve. A dataset, no matter how diverse, is a static snapshot. The framework lacks a clear path for continuous, dynamic updating of these safety perceptions, a challenge also faced by other static ethical datasets.

4. Actionable Insights

For AI practitioners and product leaders:

Immediate Audit: Use the DICES framework (distributions, not means) to audit your current safety classifiers. You will likely find they are aligned with a narrow demographic slice. This is a reputational and product risk.
Redefine Your Metric: Stop reporting a single "safety score." Report a safety profile: "This model's outputs align with the safety perceptions of Group A with X% agreement and diverge from Group B on topics Y and Z." Transparency builds trust.
Invest in Adaptive Safety: The endgame isn't one perfectly safe model, but models that can understand context, including user context. The research investment should pivot from monolithic safety filters towards context-aware and potentially user-personalized safety mechanisms, ensuring the model's behavior is appropriate for its audience. The work on value alignment in AI ethics, such as that discussed by the Stanford Institute for Human-Centered AI (HAI), emphasizes that alignment must be with a plurality of human values, not a single set.

5. Technical Framework & Dataset Design

The DICES dataset is constructed around human-bot conversations that are rated for safety by a large, demographically stratified pool of raters. The key innovation is the data structure: instead of storing a single label (e.g., "unsafe"), each conversation item is associated with a multi-dimensional array of ratings broken down by demographic buckets.

For a given conversation $c_i$, the dataset does not provide $label(c_i) \in \{0, 1\}$. Instead, it provides a set of rater responses $R_i = \{r_{i,1}, r_{i,2}, ..., r_{i,N}\}$, where each response $r_{i,j}$ is a tuple $(v_{i,j}, d_{i,j})$. Here, $v_{i,j}$ is the safety verdict (e.g., on a Likert scale or binary), and $d_{i,j}$ is a vector encoding the rater's demographic attributes (e.g., $d_{i,j} = [\text{gender}=G1, \text{age}=A2, \text{ethnicity}=E3]$).

5.1. Mathematical Representation of Rater Distributions

The core analytical power comes from aggregating these individual ratings into distributions. For a specific demographic slice $D_k$ (e.g., "Asian, 30-39, Female"), we can compute the distribution of safety scores for conversation $c_i$:

$P(\text{score} = s | c_i, D_k) = \frac{|\{r \in R_i : v(r)=s \land d(r) \in D_k\}|}{|\{r \in R_i : d(r) \in D_k\}|}$

This allows for the calculation of not just the mean safety score $\mu_{i,k}$, but more importantly, measures of variance ($\sigma^2_{i,k}$), ambiguity (e.g., entropy of the distribution $H(P)$), and divergence between demographic groups (e.g., KL-divergence $D_{KL}(P_{i,k} || P_{i,l})$). This mathematical formalization is crucial for moving beyond simplistic averaging.

6. Experimental Results & Chart Description

While the provided PDF excerpt is a preprint under review and does not contain full experimental results, the described dataset enables several key analyses that would typically be presented in charts:

Chart 1: Demographic Disagreement Heatmap: A matrix visualization showing the pairwise divergence (e.g., Jensen-Shannon distance) in safety score distributions between different demographic groups (e.g., Group A: White Male 50+ vs. Group B: Hispanic Female 18-29) across a sample of controversial conversation topics. This chart would vividly highlight where perceptions most strongly diverge.
Chart 2: Ambiguity vs. Consensus Scatter Plot: Plotting each conversation item based on its average safety score (x-axis) and the entropy of its total rating distribution (y-axis). This would separate items that are universally seen as safe/unsafe (low entropy, high consensus) from those that are highly ambiguous (high entropy).
Chart 3: Model Performance Disaggregation Bar Chart: Comparing the performance (e.g., F1 score) of a standard safety classifier when evaluated against the "ground truth" defined by different demographic groups. A significant performance drop for certain groups would indicate the model's alignment is skewed.

The power of DICES is that it generates the data necessary to create these charts, moving evaluation from a single number to a multi-faceted dashboard.

7. Analysis Framework: Example Case Study

Scenario: A conversational AI generates a joke in response to a user prompt. The training data and standard safety evaluation label it as "safe" (humor).

DICES-Based Analysis:

Data Retrieval: Query the DICES dataset for similar conversational items involving humor or jokes on related topics.
Distribution Analysis: Examine the safety rating distributions. You might find:
- $P(\text{unsafe} | \text{age}=18-29) = 0.15$
- $P(\text{unsafe} | \text{age}=60+) = 0.65$
- $P(\text{unsafe} | \text{ethnicity}=E1) = 0.20$
- $P(\text{unsafe} | \text{ethnicity}=E2) = 0.55$
Interpretation: The "safety" of this joke is not a fact but a function of demographics. The model's output, while technically compliant with a broad "safety" rule, carries a high risk of being perceived as offensive by older adults and members of ethnic group E2.
Action: A simplistic approach would be to block all jokes. A nuanced approach, informed by DICES, could be to: a) Flag this type of content as "high demographic variance," b) Develop a user context module that allows the model to adjust its humor style, or c) Provide a transparency note: "This response uses humor. Perceptions of humor vary widely across cultures and age groups."

This case study illustrates how DICES shifts the question from "Is this safe?" to "Safe for whom, and under what conditions?"

8. Future Applications & Research Directions

The DICES framework opens several critical avenues for future work:

Personalized & Adaptive Safety Models: The logical endpoint is not a one-size-fits-all safety filter, but models that can infer relevant user context (with appropriate privacy safeguards) and adapt their safety thresholds or content generation strategies accordingly. This aligns with the broader trend in ML towards personalization, as seen in recommendation systems.
Dynamic and Continuous Evaluation: Developing methods to continuously update safety perception datasets like DICES in near-real-time, capturing evolving social norms and emerging controversies, similar to how language models themselves are continuously updated.
Intersectional Analysis Tools: Extending the demographic framework to better capture intersectional identities, moving beyond independent categories to understand the compounded experiences of individuals belonging to multiple minority groups.
Integration with Reinforcement Learning from Human Feedback (RLHF): Using disaggregated human feedback from datasets like DICES to train reward models that are sensitive to demographic alignment, preventing the optimization for a single, potentially narrow, notion of "good" or "safe" dialogue. This addresses a known limitation in standard RLHF, as highlighted in research from Anthropic and DeepMind on scalable oversight.
Global Expansion: Scaling the data collection to a truly global level, encompassing non-Western cultures and languages, to combat the Anglo-centric bias prevalent in many AI safety resources.

9. References

Aroyo, L., Taylor, A. S., Díaz, M., Homan, C. M., Parrish, A., Serapio-García, G., Prabhakaran, V., & Wang, D. (2023). DICES Dataset: Diversity in Conversational AI Evaluation for Safety. arXiv preprint arXiv:2306.11247.
Bommasani, R., et al. (2021). On the Opportunities and Risks of Foundation Models. Stanford Center for Research on Foundation Models (CRFM).
Gehman, S., Gururangan, S., Sap, M., Choi, Y., & Smith, N. A. (2020). RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems (NeurIPS).
Stanford Institute for Human-Centered AI (HAI). (2023). The AI Index Report 2023. Stanford University.
Weidinger, L., et al. (2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.
Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. Proceedings of the IEEE International Conference on Computer Vision (ICCV). (Cited as an example of a framework—CycleGAN—that handles unpaired, multi-modal data, analogous to DICES handling diverse, unaligned human judgments).