Handling synonyms in Neo4j Full Text Search

ikwattro

Christophe Willemsen

Posted on November 11, 2019

Handling synonyms in Neo4j Full Text Search

So you have followed the Deep Dive into Neo4j's Full Text Search tutorial, learned even how to create custom analyzers and finally watched the Full Text Search tips and tricks talk at the Neo4j Nodes19 online conference?

Still, searching for boat does not yield results containing yacht or ship, and you're wondering how to make your search engine a bit more relevant for your users?

Don't go any further, you'll learn how to do it, now!

Synonyms

A synonym is a word or phrase that means exactly or nearly the same as another word or phrase.

Why synonyms ?

It's all about recall! In other words, to help your users find the content they're interested in without them having to know specific terms.

A user searching for coffee should probably be seeing results containing latte macchiato, espresso or even ristretto.

 Lists of synonyms

You can find 3rd party word lists for synonyms, such as WordNet or ConceptNet5, howeveer, appropriate word lists are domain/application/use-case dependent, and the best fit is generally a self-curated synonyms word list.

How to use them ?

The first thing to do, is to create a word list with the following format :

coffee,latte macchiato,espresso,ristretto
boat,yacht,sailing vessel,ship
fts,full text search, fulltext search
Enter fullscreen mode Exit fullscreen mode

The next step is to create a custom analyzer using the synonym filter. Since we're using an analyzer the first question that might come to mind is :

Do I have to reindex all the documents when my synonyms list change ?

The answer is yes, using a query time synonym filter is very bad(TM), for the following reasons :

  • The QueryParser tokenizes before giving the text to the analyzer, so if a user searches for sailing vessel, the analyzer will be given the words sailing and vessel separately, and will not know they match a synonym

  • Multi-Word synonyms will also not work in phrase queries

  • The IDF of rare synonyms will be boosted

More information can be found in the Solr documentation.

Let's create our custom analyzer for synonyms then :

@Service.Implementation(AnalyzerProvider.class)
public class SynonymAnalyzer extends AnalyzerProvider {

    public static final String ANALYZER_NAME = "synonym-custom";

    public SynonymAnalyzer() {
        super(ANALYZER_NAME);
    }

    @Override
    public Analyzer createAnalyzer() {
        try {
            String synFile = "synonyms.txt";
            Analyzer analyzer = CustomAnalyzer.builder()
                    .withTokenizer(StandardTokenizerFactory.class)
                    .addTokenFilter(StandardFilterFactory.class)
                    .addTokenFilter(SynonymFilterFactory.class, "synonyms", synFile)
                    .addTokenFilter(LowerCaseFilterFactory.class)
                    .build();

            return analyzer;
        } catch (Exception e) {
            throw new RuntimeException("Unable to create analyzer", e);
        }
    }

    @Override
    public String description() {
        return "The default, standard analyzer with a synonyms file. This is an example analyzer for educational purposes.";
    }
}
Enter fullscreen mode Exit fullscreen mode

A very important note is that the LowerCaseFilter comes after the SynonymFilter, in some use cases it causes synonyms to not be recognized, for example with the following list :

GB,gibabyte
Enter fullscreen mode Exit fullscreen mode

If the lowercase filter is applied before synonyms, then the tokens will not match.

Create a synonyms.txt file with your synonyms list in the conf/ directory of your Neo4j instance :

conf/synonyms.txt

coffee,latte macchiato,espresso,ristretto
boat,yacht,sailing vessel,ship
fts,full text search, fulltext search
Enter fullscreen mode Exit fullscreen mode

Build your analyzer jar and put it in the plugins directory of Neo4j and restart the database if needed.

Create the Index

CALL db.index.fulltext.createNodeIndex(
  'syndemo', 
  ['Article'], 
  ['text'], 
  {analyzer:'synonym-custom'}
)
Enter fullscreen mode Exit fullscreen mode

Create an Article node with some text :

CREATE (n:Article {text: "This is an article about Full Text Search and Neo4j, let's go !"})
Enter fullscreen mode Exit fullscreen mode

Query the index :

CALL db.index.fulltext.queryNodes('syndemo', 'fts')
Enter fullscreen mode Exit fullscreen mode
╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
│"node"                                                                │"score"           │
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
│{"text":"This is an article about Full Text Search and Neo4j, let's go│1.2616268396377563│
│ !"}                                                                  │                  │
└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Similarly, a search for fulltext will return the result as well. But let's get fancy, heuu fuzzy ! :

CALL db.index.fulltext.queryNodes('syndemo', 'fullt*')

No results, no records
Enter fullscreen mode Exit fullscreen mode

Prefix and synonyms ?

There is one limitation : prefix,fuzzy,.. queries do not use the analyzer, they produce term or multiterm queries instead.

But there is a trick you can use, add an NgramFilter to your analyzer and use a phrase query, so fts and its synonyms will have their ngrams tokenized and stored/retrieved in the index :

Analyzer analyzer = CustomAnalyzer.builder()
                    //...
                    .addTokenFilter(NGramFilterFactory.class, "minGramSize", "3", "maxGramSize", "5")
                    .build();

            return analyzer;
Enter fullscreen mode Exit fullscreen mode

The NgramTokenFilter will tokenize the inputs into n-grams of the given sizes, here min 3 and max 5. So for the following input :

fulltext search
Enter fullscreen mode Exit fullscreen mode

The index will contain the n-grams ful, full, fullt, ull, ullt, ullte, lte, ltex, ltext.

You can also use the EdgeNgramFilter will will produce n-grams only from the beginnig of the token, for the same example as above the n-grams will be ful, full, fullt.

Re-deploy your plugin, restart the database, drop and recreate the index and now :

CALL db.index.fulltext.queryNodes('syndemo', '"fullt*"')

╒══════════════════════════════════════════════════════════════════════╤═══════════════════╕
"node"                                                                "score"            
╞══════════════════════════════════════════════════════════════════════╪═══════════════════╡
{"text":"This is an article about Full Text Search and Neo4j, let's go│0.04872262850403786│
│ !"}                                                                                     
└──────────────────────────────────────────────────────────────────────┴───────────────────┘
Enter fullscreen mode Exit fullscreen mode

To finalize, let's try some other phrase queries :

CALL db.index.fulltext.queryNodes('syndemo', '"article fullte*"~2')

╒══════════════════════════════════════════════════════════════════════╤══════════════════╕
"node"                                                                "score"           
╞══════════════════════════════════════════════════════════════════════╪══════════════════╡
{"text":"This is an article about Full Text Search and Neo4j, let's go│2.3429081439971924│
│ !"}                                                                                    
└──────────────────────────────────────────────────────────────────────┴──────────────────┘
Enter fullscreen mode Exit fullscreen mode

Conclusion

Synonyms are a valuable asset when building search engines, offering a better recall and thus a better user experience.

You can find all the code from this blog post on this example Github repository

💖 💪 🙅 🚩
ikwattro
Christophe Willemsen

Posted on November 11, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related