Creating our Review Assistant

Do not index

Overview

With the launch of new, sophisticated LLM models, we wanted to explore the idea of a Review Assistant, a bot that can automatically parse through the diff and give key improvement suggestions. If successful, this can significantly reduce the mental overhead of the developer as it would automatically detect bugs and suggest changes accurately reflect to business logic.

Leveraging existing LLM models, we were able to create a basic proof of concept that provided insightful review suggestions regarding the changesets. It could identify anti-patterns, areas to make the code more readable, suggest performance improvements and much more.

Let’s uncover how we developed it.

Requirements

The review bot should be able to consume the whole PR and provide suggestions to improve the code. We set the bar for the product high — we wanted it to provide core value to the end user, not just surface level. These should not be basic suggestions, such as adding a comment or renaming a variable, but suggestions to actual logic. A litmus test for a basic suggestion was whether a human would lead the comment with a nit directive.

Initial Solution

We initially thought of making a very simple solution, an endpoint on the backend that loads the changeset and passes it to the LLM. For simplicities sake, we used ChatGPT’s GPT-4 API. The initial solution showed some promise — it would give relevant suggestions to the changeset. However, there were a few drawbacks:

The model would often provide basic suggestions . Often, most of the suggestions on a PR would be around adding comments to variables and functions. This did not pass our litmus test of not having a nit on the review.

The response was often not in the correct format. It would try to suggest code changes, and then return the new changeset with the suggestions implemented, rather than just the new implementation. While this is great for a system that would automatically make code changes, this was not very user friendly. We wanted responses in plain english or easily readable code.

It would often exceed the context window. GPT-4’s context window was restricted to 4k tokens (as of testing this). Due to this, a lot of PRs could not be processed as giving information about the changeset and the source files would easily exceed the context window limits. Some prompt engineering would be needed to ensure the context window limits are dealt with.

The response time was too long. GPT-4 is a very sophisticated model and generating the output would often take 1 minute or more. The product experience would feel very laggy since it would wait for the entire output, rather than show the partial response as it gets generated.

Here is an example of a sub-optimal output that was generated. It’s for this React PR

The changeset seems to be modifying the React Fiber Hooks implementation and the React Internal Types. Here are some suggestions for improving the changeset: - Use more descriptive variable names: Instead of using names like filterCallback, consider using more descriptive names that indicate the purpose of the variable. - Provide comments or documentation: Add comments or documentation to explain the purpose and functionality of the modified code. This will make it easier for other developers to understand the changes and maintain the code in the future. - Ensure consistent formatting: Check for consistent formatting throughout the changeset, including indentation, line breaks, and spacing. Consistent formatting improves code readability and makes it easier to spot any potential errors. - Consider separating unrelated changes: If possible, separate unrelated changes into different commits or changesets. This helps in better tracking and reviewing of individual changes. - Verify the use of filterCallback parameter: Review the use of the filterCallback parameter in the useContext and readContext functions. Ensure that it is used correctly and consistently across all relevant parts of the code Remember to thoroughly test the modified code to ensure that it functions as expected and doesn't introduce any regressions or bugs.

Notice how most of the suggestions relate to formatting, comments, and variable names. This would not realistically be of significant value for the author or reviewer.

Based on the MVP, we decided to iterate the product to #1 improve the response (suggestion quality as well as format) and #2 make the output render iteratively as it’s generated, similar to ChatGPT.

MVP 2

The iterations were done through 2 changes: Prompt Engineering to address #1 and enabling response streaming for #2.

Prompt Engineering

Apart from the model architecture, there are two main variables that affect LLM inference: the training data and the prompt. Even small changes in the prompt can trigger significant differences in the output of the model.

In our case, we had to go through several iterations of prompts to specify exactly what we’re looking for in the output. This included the types of suggestions, but also the format in which suggestions are provided. LLMs rely on key words to understand the intention. For example, to prevent spitting out the changeset, we had to specify Do not provide the improved changeset. Even simply saying Do not return the updated code or Do not provide the updated changeset returned incorrect values. A bunch of trial and error was involved to get to this place. I can imagine a world in which an LLM takes in the intent, spits out a bunch of prompts for a different model and identifies which one performs the best. We also iterated on what information is passed in the context window to stay within the token limit. The initial version’s input context included the changeset (code diff), source files and the PR description, and would often breach the token limit. By trivially reducing the amount of information passed (for eg. just providing the changed lines), we could solve for the token limit but in turn reduce the quality of the results significantly. Through several iterations of the input as well as the prompt, we landed on a strategy that would have low context without a significant drop in output quality.

Streaming output

Luckily, ChatGPT’s API supports streaming partial responses out of the box. Through this, we were easily able to create a ChatGPT like experience for Review Assistant.

Final Product

We have the final output where clicking a button runs the LLM and provides output that feels almost instantaneous. While the output still takes a long time to fully load, because we can see the partial outputs, it feels a lot more snappy. Moreover, the suggestions provided actually help make the code more maintainable as well as help reduce bugs. Here is a much better output for the same PR:

Here are some suggestions to improve the changeset:

In ReactFiberHooks.js, you can simplify the type definition for filterCallback by using a single type instead of a union type. You can define a default filter function that always returns true and use it as the default value for filterCallback. This way, you can avoid checking for the existence of filterCallback in multiple places.

// Define a default filter function that always returns true
const defaultFilterCallback = (prevState, nextState) => true;

// Update the type definition for filterCallback
type FilterCallback = (prevState: any, nextState: any) => boolean;

// Use the default filter function as the default value for filterCallback
useContext<T>(
  context: ReactContext<T>,
  filterCallback: FilterCallback = defaultFilterCallback,
): T {
  // ...
  return readContext(context, filterCallback);
}

In ReactFiberNewContext.js, you can simplify the condition for checking if the context matches by using the filterCallback directly, since it will always be a function.

if (
  dependency.context === context &&
  dependency.filterCallback(
    dependency.memoizedValue,
    readContextValue(dependency.context),
  )
) {
  // ...
}

In ReactInternalTypes.js, you can update the type definition for ContextDependency to use the FilterCallback type.

export type ContextDependency<T> = {
  context: ReactContext<T>,
  next: ContextDependency<mixed> | null,
  memoizedValue: T,
  filterCallback: FilterCallback,
  ...
};

In ReactHooks.js, you can update the useContext function signature to use the FilterCallback type and the default filter function.

export function useContext<T>(
  Context: ReactContext<T>,
  filterCallback: FilterCallback = defaultFilterCallback,
): T {
  // ...
  return dispatcher.useContext(Context, filterCallback);
}

These suggestions should help make the code more maintainable and easier to understand.

Hope you enjoyed the blog post, and stay tuned for more!

Creating our Review Assistant

Stay Up To Date

Stay Up To Date!