Demo: GitHub search with Manticore Search
Sergey Nikolaev
Posted on February 29, 2024
TL;DR: In this blogpost we demonstrate how we made a search app very similar to the one GitHub uses to find issues, using Manticore Search.
- Try out the demo:
- Crawl your repo - https://github.manticoresearch.com/. You'll have to wait.
- Search in a crawled repo - https://github.manticoresearch.com/manticoresoftware/manticoresearch.
- GitHub project - https://github.com/manticoresoftware/manticore-github-issue-search.
- Set it up for yourself to look through your GitHub issues, pull requests, and comments in a new way.
Introduction
In our journey to effectively highlight the capabilities and performance of Manticore Search, we realized the importance of selecting a real-world application that could act as a persuasive showcase. We considered several options — common choices like:
- e-commerce sites
- directory listings
- movie databases etc.
While these are familiar and easily understood examples and Manticore is a perfect fit for them, they fell short in offering practical value.
That's when inspiration struck: why not create a search tool for GitHub issues? Not only would this give us the chance to offer a powerful demo, but it also presented an opportunity to enhance the search experience with Manticore Search's advanced features which should be useful at least for ourselves, Manticore core team.
We embraced this challenge and are proud to present our creation — a specialized search engine tailored for GitHub issues. This isn't just a demonstration; it's a practical tool that we hope will be of great use to the Github community.
We invite you to explore and interact with our GitHub issue search at https://github.manticoresearch.com. Discover the full potential of Manticore Search through this hands-on experience. Enjoy the benefits of our enhanced search capabilities and see for yourself how Manticore Search can transform data exploration.
We've achieved search speeds up to 30 times faster in some cases than GitHub. Curious about how it works and how we did it? Let's dive into how we built it.
Prerequisites
The concept was straightforward – our objective was to fetch the data from selected repositories on GitHub into the Manticore search database. With the data full-text indexed, we could enable effective searching capabilities.
Our decision was to maintain a design that closely resembled GitHub's, but with subtle enhancements in the user interface to accommodate not just the tech-savvy users but also those with less technical expertise.
Moreover, we aimed to introduce additional features such as:
- combined search across issues and comments
- advanced filtering options
- ability to sort results based on reactions
- infinite scroll pagination
Let's delve into the details, examine the challenges at hand, and explore how Manticore can address these through a practical example.
Choosing the right tool for the MVP
Developing a demo can be quite the challenge, but when you're racing against the clock, you need all the help you can get. That's exactly why we leaned into the tried-and-true combination of PHP for the backend and JavaScript for the client side — with a dash of SEO-friendly hybrid magic. Why PHP, you ask? Well, it's like strapping a jetpack to your project's back! It's quick to start, simple to validate, and a breeze to test. And of course, because our team is experienced in PHP more than in other beautiful and modern programming languages. (BTW, read about how you can build a PHP plugin for Manticore Search written in C++.)
Manticore Search also comes with the PHP client which we utilize in the demo. Using it is as simple as follows:
<?php
use Manticoresearch\\Client;
$client = new Client(['host' => 'localhost', 'port' => 9308]);
$index = $client->index('repo');
$docs = $index->search('bug')->get();
foreach ($docs as $doc) {
var_dump($doc->getId(), $doc->getData());
}
Just like that, you create a Manticore Client, pick the table you want to chat with, send off a search request, and—voilà! — the results come pouring in.
We won't plunge into the deep end of the Manticore Search Client for PHP here, but if you're itching to give it a whirl, check out their repository at Manticore Search PHP Client.
The demo comprises various components, as it:
- Fetches data from GitHub,
- Maintains a queue of repositories to process
- Can send notifications via email
- etc. etc. etc
For your convenience, all the code interacting with Manticore Search is located in Manticore.php. This might also be useful for anyone considering a comparison of different storage engines in the future.
Interesting challenges we had to overcome
While working on the demo, besides implementing trivial things as mentioned above, we encountered a few interesting challenges that you may also stumble upon in your projects.
Relevance in Search Results when combining two tables
A critical aspect of any search system is the relevance of its results. When implementing a GitHub issue demo with Manticore Search as the backend, it's noteworthy that relevance is efficiently managed right out of the box. Manticore Search employs classical BM25-based ranking methods that order search results based on the frequency and importance of keywords within documents and queries and field length normalization (the length of the text field where the matching term is found). This means that there's no need for elaborate configurations or complex algorithms to begin with a highly effective search experience. For more details, you can refer to the documentation – Ranking Overview.
The challenge we faced involved performing a combined search within Github issues and comments. Technically, we divided this into two separate tables at the Manticore level: one for issues and another for comments. After researching ranking mechanisms, we decided to implement the Rank-Biased Precision (RBP) algorithm, which allows us to amalgamate results from two distinct sources. Additionally, Manticore Search provides a 'score' field that can be retrieved using the $doc->getScore()
method from the PHP client. You can examine the code here: Manticore.php Code.
As a result, we not only achieve relevance 'out of the box' but also leverage RBP to combine two sources, maximizing the effectiveness of the search results!
Advanced filtering of issues and comments
Step 1: Rendering the Ranges
In the realm of search functionality, a mere basic search often doesn't suffice. Users frequently need to employ filters to refine their results. Implementing simple filters, such as those based on range or equality, is straightforward in Manticore Search and many other search engines. However, when it comes to grouping results within certain ranges, the task might appear daunting — but in reality, it's quite manageable with Manticore Search.
Our goal is to enable users to select predefined ranges and apply filters accordingly, all while avoiding the need for storing or caching any additional data. For instance, we aim to filter issues by the number of comments: ≤ 5, between 5 and 10, and ≥ 10. Manticore Search simplifies this process with its INTERVAL function. Let's see how it's implemented in the demo.
We devised a special method that generates our desired ranges along with the count of items falling within each range. Here is pseudo code to understand how easy it is:
$client = static::client();
$index = $client->index('issue');
$search = $index->search('');
$range = implode(',', $values);
$facets = $search
->limit(0)
->filter('repo_id', $repoId)
->expression('range', "INTERVAL(comments, $range)")
->facet('range', 'counters', sizeof($values) + 1)
->get()
->getFacets();
You can review the complete code at the following URL:
Step 2: Applying filters
The next step involves filtering the results. This is accomplished by employing the gt
(greater than) filter combined with an or
condition. Below is a simplified representation of the code:
$search->filter('comments', 'gt', 0, Search::FILTER_AND);
$search->filter('comments', 'lte', 3, Search::FILTER_OR);
You can inspect our code snippet via this link:
Sorting by reactions
When conducting a search on GitHub, you might notice that it doesn't display or allow filtering by reactions. However, there are times when identifying the most reacted-to issues can be particularly insightful — for instance, to gauge the most desired features or anticipate upcoming concerns. This is where sorting by reactions becomes invaluable.
To begin with, we need to capture the reaction data. The GitHub API conveniently provides this in the form of a simple JSON object:
{
"url": "https://api.github.com/repos/ClickHouse/ClickHouse/issues/35407/reactions",
"total_count": 0,
"+1": 0,
"-1": 0,
"laugh": 0,
"hooray": 0,
"confused": 0,
"heart": 0,
"rocket": 0,
"eyes": 0
}
This is excellent news because Manticore Search offers native JSON support!
Next, we must consider our requirement for sorting. Do we need to sort by individual JSON fields or by the sum of multiple fields? Fortunately, Manticore Search enables us to do both. It perfectly aligns with our needs! We can directly store the JSON in the table and employ the following code snippet to enable sorting:
$search->expression(
'positive_reactions',
'integer(reactions.`+1`) + integer(reactions.hooray) + integer(reactions.heart) + integer(reactions.rocket)'
);
For a comprehensive view of the sorting implementation, refer to the full code snippet here: Manticore PHP Client Sorting Example
As demonstrated, we utilize the expression
function of the Manticore PHP client to access JSON fields using the .
notation. This approach eliminates the need for caching counters or performing additional calculations. You can create a JSON field, access it with expressions, maintain high speed, and avoid the overhead of caching mechanisms!
Faceted search
Searching and filtering capabilities are essential components of any robust search functionality. However, a common challenge arises when dealing with the speed of obtaining counts. It's widely acknowledged that achieving rapid count operations in MySQL necessitates the use of indexes. These indexes not only expand the database size but also add complexity to heavily loaded applications, which often resort to caching and subsequently adjusting these counts as necessary.
The good news is that Manticore Search sidesteps these issues entirely! With Manticore Search, retrieving counts from the database is both straightforward and swift, eliminating the need for additional caching layers.
To display real-time counts that reflect the filters applied on a page, we utilize the same filters used for the search. However, we introduce an extra query for facets, which takes just a few milliseconds. This approach allows us to obtain current counts for specified groups with virtually no overhead. Below is a concise PHP code snippet demonstrating how to accomplish this:
$facets = $search
->limit(0) // We're only interested in counts, hence no results needed
->filter('repo_id', $repoId) // Filter by repository ID
->expression('open', 'if(closed_at=0,1,0)') // Evaluate whether issues are open
->facet('open', 'counters', 2) // Get facet counts for open and closed issues
->get() // Execute the search query and retrieve the results
->getFacets(); // Extract the facets data from the results
Let's break it down: we set the limit to zero because our goal is to obtain counters, not search results. We filter by the repository ID and apply an expression to group by the closed_at
field. This grouping provides us with counters for both open and closed issues.
For those interested in the full implementation, the complete code snippet is available on GitHub: Manticore GitHub Issue Search - Manticore.php
With Manticore Search, the challenge of efficiently obtaining counts is addressed with an almost out-of-the-box solution. What could be more efficient and user-friendly? 😊
Conclusion and further plans
In the process of developing our demo project, we aimed to showcase the capabilities and efficiency of Manticore Search. The result has not only met our expectations but also provided us with a tool that enhances the way we navigate our Github repositories. Through this initiative, we've been able to demonstrate the potential of Manticore Search and have integrated a number of improvements and features that enhance the current offerings on GitHub:
- We've achieved search speeds that are noticeably faster, with searches typically completed in about 5-10ms, compared to GitHub's search times of over 200ms.
- Our demo project allows for the inclusion of comments within search results, providing a broader scope of information than what is currently available on GitHub.
- We've introduced the ability for users to sort issues based on the number of reactions, offering an additional dimension of user interaction.
- Advanced filtering options are available, allowing for more precise searches, such as displaying issues within a specific range of comments or focusing searches exclusively within comments.
We encourage you to explore these enhancements by visiting: https://github.manticoresearch.com
Additionally, for those interested in the open-source code or in running the project locally, it is accessible here: https://github.com/manticoresoftware/manticore-github-issue-search
We're also excited to announce plans to incorporate vector search (available in Manticore dev packages, preparing for release) into our demo. This upcoming feature aims to further refine the quality of results when combined with full text search, showcasing how to leverage new capabilities in Manticore to enhance search functionality, so stay tuned and follow us on Twitter.
We welcome your feedback on this practical demonstration of Manticore Search's features and capabilities and look forward to sharing more updates with you. Waiting for your feedback: issues, discussions.
Posted on February 29, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024