Semantic Search On GitHub (prototype)

sanikolaev

Sergey Nikolaev

Posted on August 26, 2024

Semantic Search On GitHub (prototype)

Introduction

GitHub's search function can sometimes struggle, especially when you try to search by asking a direct question. This approach often leads to unrelated results, which can be frustrating. This issue is more noticeable when searching through issues or pull requests, where the details really matter.

Let's look at an example:

GitHub Search vs Manticore Semantic Search on GitHub

GitHub's search has some limits, but the world of search technology is improving fast. Semantic search, which understands the context and meaning behind words, not just the words themselves, is becoming more popular. While GitHub hasn't added this feature yet, it could really help make searches better and more relevant.

With this in mind, we've built a project that uses semantic search to help developers find things easier in their repositories. We use Manticore Search, which supports Vector Search, to offer a customizable semantic search option that fits different needs. This shows just how useful and powerful this new technology can be.

FYI, Manticore Search is a powerful open-source search engine that has stood the test of time, with roots stretching back to 2001. Originally known as Sphinx, it served as a full-text search solution for MySQL and PostgreSQL databases. The project took on a life of its own when, in 2017, it was forked and reborn as Manticore Search, continuing to evolve as an independent, fully open-source search engine.

What is Semantic Search?

Semantic search is a search technique that goes beyond just matching keywords. It tries to understand the meaning and context of words to improve search accuracy, considering both the user's intent and the contextual meaning of terms.

Benefits

  • Context Understanding: It interprets the context of queries to deliver precise results.
  • Improved Accuracy: It reduces irrelevant results by understanding user intent.
  • Enhanced User Experience: It saves time and effort by quickly providing relevant information.

The Problem with Traditional Search on GitHub

When you search using simple keywords on GitHub, you often don't get what you're really looking for. Say you type in "bug fix" to find some help. The search might show you pages that mention "bug fix" exactly, but it could miss related topics like "error resolution" or "problem solving."

This type of searching can lead to lots of results that aren't helpful. Since these searches don't get the subtleties of how we talk, you can end up feeling pretty frustrated. Developers end up wasting too much time looking through these unrelated results, which can slow down their projects and cut down on how much they get done.

Keyword searches still have their place, though. For quick, specific searches where you know exactly what you need, like finding a particular error code, they can be super quick and straight to the point.

Here we are looking for "integration bugfix" on GitHub and found nothing in the same repository:

Integration Bug Fix on Github

And the same search for "integration bugfix" on Manticore Semantic Search gives us the result we are much more likely to be happy with:

Integration Bug Fix at Manticore Semantic Search

We've made a prototype of what semantic search on GitHub could look like. Check out our GitHub Issue search demo powered by Manticore Vector Search. It allows you to search through GitHub issues, PRs, and comments in a way that understands context. This is especially useful when you can't remember the exact words of an issue but know its context. You can also add your own repository here and run the project locally by following the instructions on GitHub.

Let's look at how this approach can improve the relevance and accuracy of your search results.

Success Story: Adding Vector Search to Manticore GitHub Demo

When we integrated Vector Search into our GitHub issue search demo, which showcases the capabilities of Manticore Search, the results were impressive. Traditional keyword searches are highly effective for queries where specific terms are known and accuracy in matching these terms is critical. However, the addition of semantic search complements this by allowing us to pinpoint exactly what users are looking for with greater precision, especially in contexts where the intent or meaning of the query matters as much as the specific words used.

Using pre-trained models from Hugging Face, we turned text into high-dimensional vectors. These vectors understand the meaning behind the words, allowing us to do more accurate searches.

Here are a few examples of how it can improve the quality of search in the Manticore Search repository:

Example: Finding open bugs more easily

Memory Leak Example

Imagine you're a developer looking for issues related to a specific bug. A traditional search for "memory leak" might miss issues titled "limit the memory usage" or "index out of memory". With Vector Search, the engine knows these terms are similar. This means you get all the relevant results without guessing all the possible keywords.

Example: Checking if a feature request exists before opening a new one

User Authentication Example

Think about users searching for feature requests related to "user authentication." Keyword searches might only show issues with the exact phrase, but semantic search understands related terms like "login system", "Access denied", and "Session-level user variables". This way, no valuable feedback is missed.

Example: Easier collaboration

API Rate Limits Example

Contributors working on different parts of a project can really benefit from semantic search. For example, a search for "API rate limits" brings up relevant discussions about "throttling", "250 results" limit, and "rate limiting". This helps team members connect related issues even if they use different terms.

Example: Security audits

SQL Injection Example

Security audits need thoroughness, often requiring searches for different security vulnerabilities. A search for "SQL injection" with traditional keyword methods might miss issues under "database infiltration" or "SQL vulnerability". Semantic search makes sure all related security concerns are found, helping with more complete security audits.

How to Get Started with Semantic Search on GitHub Using Manticore Search

To implement semantic search in our GitHub demo project, we followed these steps:

  • Setting Up Manticore: We integrated Manticore Search with our project by installing Manticore Search along with the Columnar library implementing the vector search functionality.

  • Creating the Database Structure: We set up a real-time table in Manticore Search to store GitHub issues and their semantic representations. These representations, also known as embeddings, are stored as arrays of numbers (also known as vectors). The table includes fields for the issue text, a unique identifier, and a vector that captures the semantic meaning of the text.

    Here's an example of the schema we used:

    CREATE TABLE issues (
        id BIGINT,
        body TEXT,
        vector FLOAT_VECTOR knn_type='hnsw' knn_dims='4' hnsw_similarity='l2'
    );
    

    In this setup, body holds the text of the issue, id is a unique identifier, and vector is a text embedding for the body.

    You might be asking: What are text embeddings?

    Text embeddings are a way to turn words or phrases into numbers that show their meaning and how they relate to each other. Think of it as a method to convert text into a format that computers can understand. These number representations help machines analyze text better, making it easier to compare different texts and find similarities.

    In our example, the vector is a series of numbers that capture the main idea of the text in the body field. This allows us to do things like finding similar issues or grouping related topics, even if they use different words to describe the same idea.

    We used an AI model from Sentence Transformers. If you want an easy way to get started, we suggest checking out the HuggingFace Text Embedding API service. It lets you run your own API and create embeddings tailored to your needs.

  • Insert Data: Fill your table with vector data.

    Let's take a look at how an insert statement would work with the schema we just talked about. This will give us a clear picture of how data is added to our database structure.

    INSERT INTO issues VALUES (
      1,
      'Hello World',
      (0.653448, 0.192478, 0.017971, 0.339821)
    ), (
      2,
      'This is a bug',
      (-0.148894, 0.748278, 0.091892, -0.095406)
    );
    
  • Query Data: Retrieve contextually relevant information using vector-based queries.

    To fetch documents using vector queries, we follow these steps:

  1. Get the search query.
  2. Generate a text embedding for the query.
  3. Use the resulting vector in a query like this:

    SELECT id, body
    FROM issues
    WHERE knn ( vector, 10, (0.286569, -0.031816, 0.066684, 0.032926) );
    

    Note that 10 is the parameter K, which represents the number of nearest neighbors (closest vectors) to retain in the result set. By default, the results are sorted by vector distance, with the closest ones appearing first.

That's a wrap! In just a few steps, we've built a Semantic Search using text embeddings and the Vector Search feature of Manticore Search. It's as simple as that! 🙌

Keyword search vs Semantic search

When it comes to enhancing code search on GitHub, semantic search offers some distinct advantages:

  • Better search results: Semantic search understands the meaning behind your queries, allowing you to find relevant code, even if it doesn't match the exact keywords you use.
  • Context-aware code exploration: It can navigate through large codebases more intelligently, helping you understand how different pieces of code relate to each other.
  • Efficient troubleshooting: By understanding the context, semantic search can quickly surface relevant issues, solutions, and code snippets that help resolve bugs faster.
  • Easier discovery of relevant implementations and ideas: It can identify similar implementations or suggest alternative approaches based on the code's intent, not just its wording.

However, it's important to consider the limitations of semantic search compared to traditional keyword search:

  • Computational complexity: Running semantic searches can require significant processing power and may take longer than keyword searches, especially in large repositories.
  • Potential for misinterpretation: The AI might not always get the context or intent of a query right, leading to less relevant results.
  • Lack of precise control: Developers might find it harder to locate exact phrases or specific code snippets when the search engine is interpreting meaning rather than matching keywords.
  • Dependency on training data: The quality of semantic search results is closely tied to the AI model's training data, which means it may not always align perfectly with the latest code patterns or terminology.

Given these considerations, the future of GitHub search likely lies in a hybrid approach. By blending the strengths of both semantic and keyword search, GitHub can offer a more powerful tool for developers:

  • User interface options: Allowing users to toggle between semantic and keyword search depending on their immediate needs.
  • Hybrid search algorithms: Combining the deep understanding of semantic search with the precision of keyword matching to provide the most relevant results.
  • Contextual switching: Automatically choosing the best search method based on the type of query and user behavior, ensuring the best possible outcomes.

Incorporating both methods into GitHub's search capabilities will help developers find the right code, faster and more efficiently, balancing the nuanced understanding of semantic search with the reliability and speed of keyword search.

The Future of GitHub Search: Smarter with Semantic Search

GitHub's traditional keyword-based search will be evolving toward a smarter, more intuitive approach: semantic search. This game-changing technology is set to revolutionize how developers interact with repositories, boosting productivity and making the development process smoother, especially when searching through pull requests, issues, and comments.

Semantic search for pull requests, issues, and comments offers several key advantages:

  1. Context-aware results: Unlike traditional search that relies on exact keyword matches, semantic search understands the context and intent behind your query. This means you're more likely to find the relevant pull requests, issues, and comments, even if they don't use the exact words you searched for.
  2. Natural language processing: You can search using everyday language, without needing to remember specific syntax or keywords. This makes it easier to find what you need.
  3. Improved relevance ranking: Semantic search can prioritize results based on how closely they match the meaning of your query, saving you time when navigating through numerous pull requests, issues, or comments.
  4. Understanding of synonyms and related concepts: The search engine can recognize related terms and concepts, widening the scope of relevant results without the need for multiple searches.
  5. Enhanced collaboration: By making it easier to find related discussions and contributions, semantic search can improve team collaboration and knowledge sharing within projects.
  6. Historical context: Semantic search has the potential to understand the evolution of discussions in issues and pull requests, offering more comprehensive results that include relevant historical context.
  7. Cross-repository insights: Advanced semantic search could potentially provide insights across multiple repositories, helping developers discover related discussions or solutions in other projects.

The implementation of semantic search is becoming more achievable thanks to advanced databases like Manticore Search. With its built-in vector search capabilities, Manticore Search is paving the way for platforms like GitHub to adopt this cutting-edge technology.

While GitHub hasn't fully integrated semantic search yet, developers can experience its power through our demo project. This open-source initiative showcases the potential of semantic search in a GitHub-like environment.

💖 💪 🙅 🚩
sanikolaev
Sergey Nikolaev

Posted on August 26, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

What was your win this week?
weeklyretro What was your win this week?

November 29, 2024

Where GitOps Meets ClickOps
devops Where GitOps Meets ClickOps

November 29, 2024

How to Use KitOps with MLflow
beginners How to Use KitOps with MLflow

November 29, 2024

Modern C++ for LeetCode 🧑‍💻🚀
leetcode Modern C++ for LeetCode 🧑‍💻🚀

November 29, 2024