July 10, 2025

12 Min Read

Large Language Models for Toxicity Detection: ToxBuster

In-game chat in video games is a powerful tool for strategy, praise, and knowledge sharing, but it can be tainted by toxic content. Toxicity is a prevalent and serious problem in online gaming communities. Research conducted by the Anti-Defamation League (ADL) reveal that exposure to toxic language not only isolates users but also inflicts a range of psychological harm (ADL, 2022). Adding another dimension to this issue, marginalized groups often find themselves bearing a disproportionate burden of targeted online hate and harassment. As such, toxicity in games can negatively impact the mental health and well-being of gamers, making it crucial to detect and prevent such behavior.

Diving into this pervasive issue, we spotlight the innovative endeavors of PhD research interns, Zachary Yang and Josiane Van Dorpe. Their work, steeped in the cutting-edge techniques of Natural Language Processing (NLP) and Large Language Models (LLMs), addresses the pervasive issue of in-game toxicity. Among their notable achievements is 'ToxBuster', a model designed by Zachary Yang to detect toxicity in in-game chats. You can read more about Zach's work in this article, presented at EMNLP 2023. In parallel, Josiane Van Dorpe crafted a unique approach, complete with a dedicated dataset, to pinpoint identity biases within the 'ToxBuster' model. This was accomplished through a comprehensive reactivity analysis. Josiane's research was also presented at EMNLP 2023 and you can read about it here.

Detecting toxicity is a challenging task that involves analyzing and categorizing harmful language. This becomes even more complex in environments like text chat, where messages can be brief, colloquial, and full of abbreviations. A strong toxicity detection model must accurately interpret these linguistic nuances to understand context and meaning.

ToxBuster Architecture

To address this challenge, La Forge presents ToxBuster, a simple and scalable model that reliably detects toxic text-based content in real-time for a line of chat by including chat history and some game metadata. At the heart of ToxBuster lies the Bidirectional Encoder Representations from Transformers (BERT) architecture (Devlin et al., 2019). BERT is a pre-trained language model that has revolutionized natural language understanding tasks. BERT is a language model that has greatly improved how computers understand text. It learns by reading large amounts of text and figures out how words relate to each other based on the sentences they appear in. ToxBuster leverages BERT's power to understand and predict toxicity in chat messages. This means that it goes beyond analyzing individual chat lines as it considers the chat history as well (Figure 1).

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - Figure1

Figure 1. Overview of the ToxBuster system featuring chat speaker segmentation. The model processes input embeddings that combine token, position, team ID, chat type, and player ID information. The chat history includes as many previous lines as available. N.B., Figure adapted from Yang et al. (2023).

By incorporating context, it gains a deeper understanding of the conversation dynamics and context-dependent toxicity. ToxBuster can identify and sort different kinds of toxic messages, from mild insults to serious and even illegal threats.

In multiplayer games, different players interact in chat (Figure 2). ToxBuster uses speaker segmentation to include chat metadata such as the speaker info and the intended audience. This explicitly differentiates each line in chat for the model, especially capturing the multiple speakers and their team dynamics (Figure 1). Three crucial metadata attributes accompany each chat line:

PlayerID: Identifies the speaker.
ChatType: Distinguishes between (i) global chat, a platform where all users can communicate with each other; (ii) team chat, a dedicated space for a specific group of users to interact; and (iii) private messages, which are one-on-one conversations between two users.
TeamID: Specifies the team affiliation.

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - figure2

Figure 2. Example of chat line segmentation and metadata in multiplayer games. Each chat message is annotated with speaker information and intended audience, allowing ToxBuster to distinguish between multiple speakers and capture team dynamics. Three key metadata attributes are included for every chat line.

These metadata features enhance ToxBuster's ability to differentiate between friendly banter, heated exchanges, and outright toxicity.

For publicly available code, please find it here.

ToxBuster Evaluation

To better understand the effectiveness of ToxBuster in a real-world context, comprehensive tests were conducted using datasets from popular games like Rainbow Six Siege, For Honor, and Defense of the Ancients 2 (DOTA 2). Additionally, the tool was challenged with the Civil Comments dataset, a collection of online news comment threads, to further evaluate its performance. The results were impressive, as seen in Figure 3 and Figure 4:

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - figure3

Figure 3. ToxBuster's ability to transfer across different datasets (F1 Scores). Figure adapted from Yang et al. (2023).

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - figure4

Figure 4. Comparison of toxicity classification performance across different datasets. The results show weighted average precision (P), recall (R), and F1 scores (F1). While baseline models address the binary classification of toxic versus non-toxic messages, ToxBuster is evaluated on each specific toxic class. Figure adapted from Yang et al. (2023).

ToxBuster Applications

ToxBuster is more than just a theoretical concept; it has tangible, real-world applications that can significantly improve the online gaming environment. One of its key features is real-time moderation. This allows for the automation of in-game chat moderation, with ToxBuster identifying and flagging potentially toxic content. This immediate detection enables human moderators to intervene swiftly by identifying toxic players at scale, thereby enforcing community guidelines and fostering a healthier gaming atmosphere. These practical implications of ToxBuster demonstrate its potential as a powerful tool in combating toxicity in online gaming.

Identity Biases in Toxbuster

While the idea of implementing this model in real-world scenarios is exhilarating, it is essential to tread with a balanced sense of caution to ensure its responsible use. It is also important to remember that Language Learning Models (LLMs) are not infallible and may exhibit certain limitations or drawbacks when applied to toxicity detection. Given the vast amount of information they are trained on, they might inadvertently adopt biases or stereotypes from their training data, which could impact their performance or fairness.

One such bias is identity bias, which arises when individuals or groups are treated differently or unfairly based on their identity attributes, such as gender, race, ethnicity, religion, etc. This can lead to negative consequences for those affected, including discrimination, exclusion, or harm.

This bias can pose a significant challenge for toxicity detection models, particularly those that utilize LLMs trained on general text data. For instance, a toxicity detection model might assign higher toxicity scores to chat messages containing certain identity terms or expressions, even if they are not used in a toxic manner (Figure 5). This could result in false positives or false negatives, thereby affecting the model's accuracy and fairness.

It's crucial to clarify that while these findings are promising, this is still an area of ongoing research and not yet a fully industrialized solution. As we continue to refine and improve these models, we remain committed to addressing these challenges to create a safer and more inclusive online environment.

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - figure5

Figure 5. Identity terms from certain groups are more often flagged as toxic.

Identity biases

In an innovative approach to detecting identity bias in toxicity detection models, Josiane developed a unique dataset and methodology. This allowed her to construct a watchlist of identity-related terms, providing a means to monitor the model's reactions - whether it was overly sensitive or not responsive enough to these terms. This watchlist serves as a critical tool in understanding and addressing the model's behavior towards different identities.

The dataset was created using 22 sentence structures that are commonly found in real chat conversations. These structures were filled with 46 different terms that represent various categories of identity, like sexual orientation, religion, origin, and age.

To make sure these terms and sentence structures were relevant and inclusive, they asked for input from volunteer members of different employee resource groups within Ubisoft. These groups represent a diverse range of identities, ensuring a broad perspective. With these sentence structures and terms, they created a total of 16,008 synthetic (or artificial) chat lines (Figure 6). These are not real chat lines from users, but they mimic real conversations. Examples include sentences like "I like gay guys" or "I hate black females". Using these sentences, we can obtain the model's predictions of toxicity and measure the biases related to different identities in its output.

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster -figure6

Figure 6. Examples of sentence templates for synthetic generation of chat lines to investigate bias.

To identify biases, a ground truth is needed as a reference to know what is biased or not. For this project, this was achieved by obtaining labels from four participants from within Ubisoft. They annotated a subset of 1,363 lines based on whether they were toxic or not. Using a random forest model, annotations were applied to the rest of the dataset, resulting in a binary label for each line.

The dataset with the ground truth annotations is publicly available and can be found here.

Reactivity analysis and results

To see where ToxBuster was either too sensitive or not sensitive enough to certain terms, the reactivity score for each term was first calculated, which measures the difference in the probability of toxicity when the term is present or absent (see Figure 8; Gelman & Hill, 2006). This way, it was possible to identify which terms have a high or low impact on the toxicity of a sentence.

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - Figure7

Figure 7. Equation to obtain the reactivity score

Terms with two main traits can be added to a watchlist of terms that may convey biases in the model: they have a high or low reactivity score relative to the ground truth (see Table 1), and there is a poor prediction performance by the model. The metric used to evaluate the performance is the F1-score.

[Studio LaForge] Large Language Models for Toxicity Detection: ToxBuster - table

Table 1. Ground truth: Seven highest reactivity scores in the ground truth. Toxbuster: Final watchlist of terms. Terms with mismatched reactivity in relation to the ground truth, as well as a low F1, form the watchlist. Terms in bold can be seen both in the ground truth's seven highest terms and in the watchlist.

The final watchlist is in the ToxBuster section of Table 1. Compared to the ground truth, these terms have a high reactivity and an overall low F1-score. The term "yellow" is included in the watchlist as humans were very reactive to it, while ToxBuster considering it as a non-toxic term.

Conclusion

In this post, we highlighted the work of two student interns who advanced toxicity detection in video games using natural language processing and large language models. Zachary Yang developed ToxBuster, a scalable model that detects toxic content in real time by analyzing chat history and metadata. Built on the BERT architecture, ToxBuster can accurately identify a range of toxic behaviors, from insults to threats, by understanding the context of in-game conversations. Josiane Van Dorpe's research complemented this work by addressing identity biases in the model. Her creation of a specialized dataset and analysis methods ensures ToxBuster treats all players fairly, regardless of identity-related terms.

Our results show that ToxBuster performs well on real-world data from games like Rainbow Six Siege and DOTA 2, as well as public datasets. Its ability to interpret short, slang-filled, and context-dependent chat messages marks a significant step forward. The ongoing research into bias and fairness highlights the need for continuous improvement and monitoring of these systems.

ToxBuster is now actively moderating chat in select Ubisoft online games, helping to create a safer and more inclusive environment. We continue to advance our detection models by exploring how human moderators and AI can best work together, adding new tools and context to improve accuracy, and distinguishing between game-specific language and truly toxic content. For example, we're working to ensure that phrases like "go to the kitchen" are recognized as in-game strategy rather than inappropriate language. For more details, check out our recent publications here: [Ubisoft La Forge | Ubisoft]. Together, these efforts reflect our commitment to building safer, fairer gaming communities by combining cutting-edge AI with human expertise.

These projects, a testament to collaborative effort, were brought to fruition with the support of the MITACS fellowship, the guidance of academic advisors like Dr. Rabbany from McGill University, and Grégoire Winterstein from UQAM and the collective efforts of the teams at Ubisoft La Forge, Ubisoft Montréal User Research Lab, and Ubisoft Data Office.

References

ADL. (2022). Hate Is No Game: Hate and Harassment in Online Games 2022. https://www.adl.org/resources/report/hate-no-game-hate-and-harassment-online-games-2022

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North, 4171-4186. https://doi.org/10.18653/v1/N19-1423

Gelman, A., & Hill, J. (2006). Data Analysis Using Regression and Multilevel/Hierarchical Models (1st ed.). Cambridge University Press. https://doi.org/10.1017/CBO9780511790942

Yang, Z., Maricar, Y., Davari, M., Grenon-Godbout, N., & Rabbany, R. (2023). Toxbuster: In-game chat toxicity buster with BERT. arXiv preprint arXiv:2305.12542.