Impact of an LLM-based Code Review Assistant in practice

Code review is a standard practice in modern software development, aiming to improve code quality and facilitate knowledge exchange among developers. As providing constructive reviews on a submitted patch is challenging, Ubisoft has invested in developing and evaluating a customized code review assistant, RevMate, to assist our developers in code review tasks. This initiative was done in collaboration with Mozilla, who developed the plugin for their own needs as well.

Details about our study can be found in our paper: [2411.07091]

Why do we need to evaluate Review Assistance in practice?

Software code review is a core practice for modern software quality assurance, widely adopted in industrial and open-source projects. While, initially, code review mostly was a synonym for code inspections on a submitted patch (structured as a set of chunks, i.e., successive lines of modified code), the field gradually adopted a more dynamic approach to performing reviews commonly known as modern code reviews, embracing social dimensions, like facilitating knowledge transfer between developers and strengthening synergy within teams.

Despite such benefits, code reviews can also bring additional costs, due to the delay between a patch submission and its final approval for integration by the reviewers caused by back-and-forth between its author and reviewers. Additionally, providing valid and effective reviews requires non-trivial efforts from reviewers in terms of technical, social, and personal aspects. Reviewers need to understand and rationalize the overall impact of the changes under analysis while using effective communication. Reviewers, even when qualified and focused on a patch, might miss issues due to fatigue, distraction, or pressure from other deadlines.

As providing constructive reviews on a submitted patch is a challenging and error prone task, the advances of Large Language Models (LLMs) in performing natural language processing (NLP) tasks in a human-like fashion have prompted researchers to evaluate the LLMs' ability to automate the code review process. However, outside lab settings, it is still unclear how prone reviewers are to accept comments generated by LLMs in a real development workflow.

To fill this gap, we conduct a large-scale empirical case study in a live setup to evaluate the acceptance of LLM-generated comments and their impact on the review process. This case study was performed in two organizations, as this work was done in collaboration with Mozilla. However, in this blog we will report observations from Ubisoft only.

How does our Review Assistive tool, RevMate, work?

For this evaluation, we built RevMate, an LLM-based review assistant tool that generates review comments and is easy to integrate into modern review environments. RevMate builds on GPT4o and uses both (i) Retrieval Augmented Generation (RAG) to enclose relevant information and ground the model on the project under analysis, and (ii) LLM-as-a-Judge to leverage LLMs' capacity to evaluate generated content and discard irrelevant review comments. As we wanted to gather as much review evaluations as possible despite the limitations of human-evaluations, we settled for two variants:

RevMate with extra code context (Code): the model can request a code retrieval tool to provide function definitions and additional code lines from the codebase under analysis.
RevMate with related comment examples (Example): for each patch, the model dynamically selects few-shot examples of review comments similar to its chunks.

We integrated RevMate into participants' review environment, making it available to them when performing their day-to-day code review tasks. Through the case study, reviewers can evaluate each generated comment, which is presented as a suggestion through an evaluation box shown in Figure 1: "add to comment" to accept it or "ignore" to reject it. In case of rejection, we additionally asked reviewers to provide a reason for ignoring comments, giving the options shown in Figure 2. Before evaluating the comment, reviewers can also edit its content. Once evaluated, the evaluation box disappears, leaving no trace in the case of rejection and leaving a publicly published comment in the case of acceptance. Each generated comment can only be evaluated once.

[Studio LaForge] Impact of an LLM-based Code Review Assistant in practice - figure1

Figure 1 : RevMate's UI for generated comments' evaluation boxes

[Studio LaForge] Impact of an LLM-based Code Review Assistant in practice - figure2

Figure 2 : RevMate's UI for choice of reasons for ignoring generated comments

Spanning over 6 weeks, our case study at Ubisoft involved 31 reviewers, covered 422 patch reviews, and led to the evaluation of 1.2k generated comments. During this study, we monitored the reviewers' interactions with an LLM-based review assistant, assessed the impact of generated comments on the review flow, and finally conducted a survey of the participating reviewers to gain insight into their perspectives, filled by 22/31 participants.

How frequently do reviewers use comments generated by RevMate?

During this study, at Ubisoft, generated review comments achieved promising appreciation ratios (28.3%). The appreciation ratio reflects three things: comments marked as accepted (7.2%), as useful for developers (12.8%) and as useful for reviewers (7.7%).

Useful for developer comments reveal that 12.8% of generated comments could actually be more valuable for developers before they submit their changes. Those generated comments are thus shown too late in the development process. One participant reported in the survey: "It detects perfectly possible flaws in the code, but most of the suggestions are more in the development realm than the review", with 4 participants making similar comments. Providing these suggestions to the developers before the peer review step as a preview could be a valuable improvement of development pipelines, as such "Useful for developer" comments could improve the quality of the code shown to reviewers.

Users value "Useful as tip for reviewers" comments as they guide the review. Although some comments are not actionable or do not mention specific problems, they can still raise interesting questions. One participant stated: "there's more value in highlighting potential issues with a more concise explanation and let the reviewer write the comment", with similar feedback provided by 10 other participants. Thus, we suggest that these review tips be presented separately from publishable comments, as they serve a different purpose.

How does the comment category correlate with acceptance ratios of RevMate?

Our study shows that Functional and Refactoring comments are the most commonly generated comments, with respectively 84.8% and 14.5% for Ubisoft. We found that the distribution of categories corresponds to the distribution in the case of human-written comments, for which we got 81.4% Functional and 15.8% Refactoring comments for a sample of 1211 human-written comments.

Furthermore, refactoring comments are accepted more than functional comments (18.6% of refactoring comments vs 5.2% of functional comments at Ubisoft). However, when not-accepted, refactoring comments are also more often marked as "incorrect" (25% vs 15.6%). Functional comments being more difficult to accurately generate from patches, the assistant tends to take less risk for that category and generates more obvious and trivial comments, which tend to not be published by reviewers. Through manual inspection of a sample of 244 generated comments, we determine their actionability (i.e., if authors understood the change that was needed from the comment). We observed that among the 244 labelled comments, where 79% of "Accepted" and 62% "Rejected" comments were actually actionable, 36% of the "Useful" comments were actionable, showing the model indeed taking less risk with those comments' claims.

How does the adoption of RevMate impact the patch's review process?

Generated comments, when accepted, impact the code similarly to human-written comments. In Ubisoft, accepted generated comments lead to as many revisions on the line (62.3% and 64.3%) and chunk level (73.9% and 73.2%) as human comments do.

Take aways

Code review plays an essential role in modern collaborative software development, representing a way to improve the quality of evolving systems while improving developers' social and technical skills. Recent studies have introduced review comment generation tasks into the review process, enhanced by LLM-based approaches. However, they have not evaluated the impact of such approaches on the review process.

Through a large-scale case study, we report an appreciation ratio of 28.3% at Ubisoft. Refactoring comments, despite being the second most popular type of generated comments (14.5%) after functional comments (84.8%), have significantly better acceptance levels (18.6% vs 5.2%). Regarding accepted comments, we find that their impact on patches' review processes is similar to human-written comments, as 74% vs 73% of comments have at least one follow-up revision at chunk-level. Following those promising results for the future adoption of LLM-generated review comments, we integrated our solution into our productions to assist our developers in their review task. The integration was designed to be usable for self-review (i.e., reviewing their own code) and peer review settings.

We hypothesize that the gap between acceptance and appreciation of generated review comments stems from the task not being properly defined for the LLM. Indeed, the LLM involved addresses multiple tasks, i.e., generating publishable comments, tips for developers and tips for reviewers, all of those being valuable to the review process. Future work should explore improving overall comment acceptance, while considering distinct comment types, e.g., publishable comments, comments for developers or review tips. Furthermore, we could explore the categories of review comments, considering generation of each category as a separate task, and focus the generation on specific categories, depending on reviewers' preference. The task of functional comment generation should be improved as it shows a lower acceptance rate, although it is the type of comments that humans write the most.