Question-Answering on Source Code Repositories by Combining Local and Cloud Processing

Question-Answering on Source Code Repositories by Combining Local and Cloud Processing

Question-Answering on Source Code Repositories.

Understanding new code repositories is a challenging task. It is so challenging that developers often spend weeks or months getting to know the codebase of a company, along with intensive help from colleagues. And if no-one is available who deeply knows the code repository, it can quickly become frustrating.

LLM chatbots like ChatGPT have become everyday tools for people trying to learn new programming languages or understand specific frameworks. Their knowledge, however, only extends to public code repositories that have either made it into the training set or can be supplied via external source code repository tools.

To allow our users to ask questions about any code base that they are interested in, we at Pieces have developed a contextual question-answering system for code that can ingest any locally saved source code repository. Once pointed at a code directory, it can help answer questions about the code base, or even generate additional helpful code, such as unit tests.

In this blog post, I want to describe how our system works and what makes it run fast as well as perform well on any personal code repository.

Part 1: Indexing Code Repositories

To be able to work with a source code repository, we first need to gain an overview of the code within the repository. We do this through a divide and conquer approach: first, we segment the code of each file by topic. Then, we feed each segment through a custom-trained encoder model and save the indexes in a vector database.

Since the indexing step is run locally on users’ machines and on potentially large code repositories, we opted for a method that can run fast and reliably on large and small repository sizes while being light on memory usage. And since the landscape of programming languages is very diverse, we steered away from parsing tools such as tree-sitter that would limit the usage to a fixed number of programming languages.

For segmentation, we use a greedy algorithm that tries to find line breaks that define topic and flow changes within the code. Once the paragraph breaks are found, a post-processing script is run over the file which moves the line breaks up or down to account for unclosed breaks and indentation.

For the encoder, we use a custom trained model which generates sentence embeddings using a shallow network and word re-weighing. This allows phrase search to remain accurate while also encoding semantic similarity. Through this method, we allow users to index their code repositories once, and then ask any number of questions about it without having to wait for re-indexing. And since the indexing is performed locally, it can be run in the background without requiring network access.

Part 2: Retrieval and Question-Answering

To allow our generative language model to answer users’ questions accurately, it needs access to as much relevant information as possible. On the flip side, the maximum amount of context the large language model (LLM) can ingest is limited, so we need to find only the most relevant parts of a script that can help answer the question.

To retrieve the relevant code snippets from the private source code repository, we encode the user query using the same method we also used for code snippet encoding. Using our vector database, we then retrieve those segment embeddings that are closest to our query embedding. We then adapt the length of the retrieved results dynamically to be able to fit the prompt window of our LLM.

Given the user’s question and a concise selection of the relevant source code segments associated with the question, the cloud LLM can then generate a suitable answer. And if the question follows a longer conversation, a summary of the conversation is passed to the model as well. To make the conversation flow even easier for the user, we also display suggested follow-up questions for the user to ask.

Summary & Outlook

This combination of local, fast indexing and retrieval, and accurate answer generation using a cloud LLM allows for large code repositories to be digested and queried easily. Currently, we access OpenAI cloud models to answer user’s questions and generate custom code. In the future, we plan to allow users to choose between a variety of language models, including locally running, privacy-preserving models and company internal server hosted models.

Try the Pieces Copilot completely free, and start asking questions about your local source code repository.