@radekmie

On AI Code Reviews

By Radosław Miernik · Published on · Comment on Reddit

Table of contents

Intro

For a person who’s about to finish their PhD in computer science with a focus on artificial intelligence, I’m extremely sceptical about using LLM for anything. You may see me as an old man yelling at the cloud (though I’m not even thirty), but I’ll trade certainty over quantity every single day. There’s a certain level of quality I expect of myself that I just can’t force myself into giving up1.

To this day, I never willingly2 used an LLM except for two things: experimenting with some local models for analyzing SMTP messages3, and getting my code reviewed using GitHub Copilot. In this text, I want to discuss the latter.

It may come as a surprise, but I will not be strictly negative about it, so if you’re looking for some arguments for using it, keep reading. But don’t worry, I will be mostly negative anyway.

Setup and costs

Setting up Copilot pull request reviews is as easy as configuring a ruleset. If you never used them before, do check them out – there’s stuff like blocking force pushes or certain file extensions (like .env).

However, it only runs for users with an active license. Not a problem, but I wish it could be clearer from the start, as only a few people in my team had one, so… Now we pay $19/month/person for AI reviews (business version is more expensive than the individual one). That’s not a lot compared to other costs (including salary), but I wish there were some review-only pricing tiers.

Note that it’s bundled with the “standard Copilot”, i.e., the coding agent. I won’t talk about it today, but if you’re already using it (and paying for it), pull request reviews are included.

Good parts

The first pull request it ever reviewed included a few new Playwright test scenarios, and it only left exactly one comment… All it found was a typo in a test step name; “visbile” -> “visible”, to be precise.

Was it critical? Of course not. Would a manual review point it out? Most likely yes (we usually do). But was it worth $19 each month? I’d say yes, as since the manual review passed on the first round, we merged it straight away; that’s one human-in-the-loop iteration less. Nothing crazy, but a nice improvement.

In the following days, we got more reviews coming in every day (also because I realized I had to buy licenses for everyone). After a while, it was clear that while most of the comments are rather nitpicks (it often suggests adding a lot of comments, or documenting usage of some less-common CSS properties), it’s generally worth keeping.

So far, the suggestions I liked the most were test-related (additional assertions, etc.) and simply typos. The former is a matter of taste (I like having exhaustive assertions); the latter is simply a time-saver. (Please note that some “typos” are critical, e.g., an incorrect enum variant used.)

Meh parts

We enabled the automatic reviews and re-reviews on code pushes. Sure, documentation mentions it, but I really dislike the fact that it comments on the same thing over and over again. I don’t mind if it’s one comment repeated three or four times, but we have a pull request with over 100 Copilot comments.

Such duplication goes even further – if you repeat a certain pattern a few times in a single commit, you will get a few extremely similar suggestions. At the same time, we also got multiple different comments for the same piece of code. I believe it will be improved at some point, but there’s nothing around it on the public GitHub roadmap yet.

Also, while I like when the comments include some less-popular features (e.g., quirks around value_format in LookML), it often results in another comment after applying the suggested change, stating that you’re using it. What’s worse, it can take like five more steps to “make it happy”. Of course, you don’t have to apply these, but it’s often the case that the proposed things sound interesting.

Bad parts

There’s a lot of noise. Reviewing the comments is not a problem – it takes a few minutes. The problem is that due to the way GitHub UI works, we end up with massive discussions. These clutter the code diffs (resolved comments are still displayed there) as well as the pull request itself. In an extreme case, we have over 150 comments, with more than two-thirds coming from Copilot…

…which makes GitHub crawl. That’s 1.17MB of HTML (246KB gzipped). And it takes between 3 and 8 seconds to load4! Here are sample server timings:

Server timings

Once it burns some of Microsoft’s CPU, it eats up mine, as it takes the next few seconds to render it (I tried all kinds of browsers, including Chrome, Firefox, and Safari). I was surprised to scroll through the HTML and see the comments there, as all of the conversations are already marked as resolved, i.e., not visible until explicitly opened. There were also a lot (>100) of the following blocks:

<include-fragment src="/ORGANIZATION/REPOSITORY/security/overall-count" accept="text/fragment+html" data-nonce="v2:UUID" data-view-component="true">
    <div data-show-on-forbidden-error hidden>
        <div class="Box">
            <div class="blankslate-container">
                <div data-view-component="true" class="blankslate blankslate-spacious color-bg-default rounded-2">
                    <h3 data-view-component="true" class="blankslate-heading">        Uh oh!
                    </h3>
                    <p data-view-component="true">
                    <p class="color-fg-muted my-2 mb-2 ws-normal">
                        There was an error while loading.
                        <a class="Link--inTextBlock" data-turbo="false" href="" aria-label="Please reload this page">Please reload this page</a>
                        .
                    </p>
                    </p>
                </div>
            </div>
        </div>
    </div>
</include-fragment>

…which is somewhat funny.

Another problem is that it doesn’t really understand static analysis comments, like eslint-disable-next-line or @ts-expect-error. I know these are a last resort, but we frequently use them while interacting with 3rd party APIs (e.g., in the files with generated types). This not only leads to duplicated comments, but also to really long comments, as these often include code suggestions.

And finally, it gives you a fake sense of security. Sure, most people will find it better than no review, but there’s a problem – if it found a “real bug” once or twice, you start to treat no comments as a green light. It’s not a problem of Copilot or any other tool; it’s something you have to cater in your team.

Closing thoughts

Our experiment with the AI code reviews is still ongoing. So far, the team sentiment is “such comments are worth the noise”; from the budget perspective, it seems to be worth it. Will we give the code generation a try…? Definitely not until I see it actually works – I value the mental health of my colleagues higher than their performance5.

I’m done here; now let’s write some proper, human review.

1

Sure, it’s the “bad” side of being a perfectionist. But so far it only ever turned out for good for me – both in my professional and private life. Or maybe I’m not a perfectionist, but rather a quality-oriented freak. (And most likely not autistic though; I got checked twice.)

2

I was, however, forced to use one as a customer reaching out to support. In my experience, it’s surprisingly common for a question to immediately redirect you to a human agent, at least among delivery companies, e-commerce, and banking. Otherwise, e.g., with GitHub, I had to paraphrase my request four or five times in order for the bot to give up.

3

It’s normal to analyze emails with LLMs, right? Unfortunately, yes, and even some of my former colleagues don’t really care about copy-pasting their emails into models with questionable data privacy policies.

As I had to extract some information from ~500 messages, I decided to do it using an offline model, and quickly learned that while LLMs are fine at dealing with text, they can’t really handle encoded text (example). It was fun: deepseek-r1 started speaking Chinese, llama3.1 “detected” some LaTeX equations and tried to solve them, and other models just generated random text. Luckily, once I extracted the actual message from an encoded email, it was much better.

4

If someone from GitHub is reading it, please let me know where I should reach out with such bug reports. I’m discouraged from reporting anything to you, since I never got a single reply to any of my past reports.

5

I love this Reddit thread about Microsoft employees going insane while trying to guide Copilot to work on a few rather complex codebases. It’s a few months old now, but from what I hear from my friends at different companies, it’s not really better today, even with the latest models.