How to integrate a GitHub Repository with Large Language Models (LLMs)

In this guide, we will explore how to combine the capabilities of AI with your GitHub repository, allowing you to query your codebase efficiently.

Prerequisites:

Ensure you have the pickle, os, and llama_hub libraries installed.
Get your API keys ready for both OpenAI and GitHub.

Step-by-Step Guide:

1. Environment and Library Initialization:

Start by importing necessary libraries.

import pickle
import os

2. Configuring OpenAI:

Before diving deep, you need to set the API key for OpenAI.

os.environ["OPENAI_API_KEY"] = "<YOUR_OPENAI_API_KEY>"

3. Load Llama Index:

The Llama Index is responsible for fetching and indexing data. Ensure you've downloaded the loader for the Github repository.

from llama_index import download_loader, GPTVectorStoreIndex
download_loader("GithubRepositoryReader")

4. Setting Up the GitHub Client:

For connecting with your GitHub repository, initialize the GitHub client.

from llama_hub.github_repo import GithubClient, GithubRepositoryReader

5. Fetching Repository Data:

Use the GitHub client to fetch the data from your repository. Here, we focus on the repository sec-insights owned by Llama Index team, specifically extracting Python files from the backend directory.

# File path and parameters
file_path = "docs.pkl"
github_personal_key = "<YOUR_GITHUB_PERSONAL_KEY>"
repository_owner = "run-llama"
repository_name = "sec-insights"
directories_to_include = ["backend"]
file_extensions_to_include = [".py"]
github_branch = "main"

# Load existing data if file exists
docs = None
if os.path.exists(file_path):
    with open(file_path, "rb") as f:
        docs = pickle.load(f)

# If data not loaded, fetch from GitHub and save
if docs is None:
    github_client = GithubClient(github_personal_key)
    loader = GithubRepositoryReader(
        github_client,
        owner=repository_owner,
        repo=repository_name,
        filter_directories=(directories_to_include, GithubRepositoryReader.FilterType.INCLUDE),
        filter_file_extensions=(file_extensions_to_include, GithubRepositoryReader.FilterType.INCLUDE),
        verbose=True,
        concurrent_requests=10,
    )
    docs = loader.load_data(branch=github_branch)

    # Save fetched data for future use
    with open(file_path, "wb") as f:
        pickle.dump(docs, f)

6. Indexing Data

Once your data is fetched, it's time to employ the powers of AI. We use the GPTVectorStoreIndex to index our documents. This tool converts our documents into vectors which can be efficiently searched later.

index = GPTVectorStoreIndex.from_documents(docs)

7. Querying Your Repository:

With everything in place, you can now query your indexed data. For example, to gain insights into your API endpoints, simply execute:

query_engine = index.as_query_engine()
response = query_engine.query("Retrieve all API endpoints")

print(response)

Security

When working with OpenAI APIs, it's important to understand that your data will be processed and enhanced through their systems. If security is a concern, consider switching to local models like llama.cpp and Hugging Face embeddings.

Using these local solutions ensures your data remains within your infrastructure. Just ensure your hardware is up to the task to efficiently handle these models.

Conclusion:

By integrating AI capabilities with your GitHub repository, you can derive insights, query specific parts of your codebase, and enhance your development workflow. It's a leap towards smarter code management and comprehension.

Remember to keep your API keys confidential and adjust the repository specifics to your needs. Happy coding!

Do you want to discuss AI or Startups? DM me on X(previously Twitter) or LinkedIn

References

https://github.com/EmanuelCampos/gh-index
https://llamahub.ai/l/github_repo
https://gpt-index.readthedocs.io/

Blog