Mike learning: REPLACE, UPDATE, wordforms

anstalf

Mike

Posted on June 6, 2024

Mike learning: REPLACE, UPDATE, wordforms

About me

Image description
Hello, it's Mike again.
I recently started working at Manticore as a Developer Advocate. I am someone not completely distant from IT, but I'm catching up with modern technologies. In this blog, I'll share my experiences and what I learn about Manticore. I plan to document my journey in a diary format, explaining what Manticore is and how to use it. Let's discover how things work together, identify issues, and engage with developers in real time.
This is my second blog post. If you are interested in learning about Manticore with me, I will keep you updated in:

Step two: REPLACE, UPDATE, wordforms

While I was writing my first article, my friends from the pet store reached out to me. They want to use the database for their store, and since their salesperson is still getting confused with leashes and collars, I want to implement additional features of the database.

This article serves as a continuation of the previous one and is based on it.

Unlike most databases, Manticore uses an advanced model to recognize the texts it stores. The text processing system is based on NLP (Natural Language Processing) solutions.

A little description of the technology for a general understanding will not hurt here. NLP is designed to recognize the “natural” language in which we communicate. At first glance, it seems that there is nothing complicated in text recognition, technically it has become so with the advent of text processing solutions using machine learning algorithms. In our case, we will not go so deep and will use ready-made text processing solutions from Manticore Search, which are already built into the database. This system uses tokenization (separation of text) into small separate parts: by sentences and by individual words, which allows you to quickly find the desired words, phrases, and paragraphs as a whole in the database. (For more info on data tokenization, check out this link.)

Here are a few words about the technologies Manticore exploits:

  • Stemming is the process of reducing words to their root form. For example, "walking," "walks," and "walked" are all stems of the word "walk."
  • Lemmatization is the process of changing different forms of a word back to its basic form, called a lemma. For instance, the word "eat" can appear as "eating," "eats," and "ate." The lemma for all these variations is "eat," which is its basic dictionary form.
  • To improve the accuracy and quality of advanced search, several other solutions are involved in the work: word forms, exceptions, and stop words.

The lemmatizer and stemmer perform one common function of normalizing words to the same form, but in different ways, each of which has its advantages and disadvantages.

Also, stopword files with lists of common words for the chosen language, like articles, conjunctions, and exclamations, help speed up the process. Basically, all those little words that make our language sound nice but don't really mean much to a computer.

If the standard set of functions is not enough for the comfortable operation of the search engine, for example, when the database has professional jargon or local slang, and there are logical connections between words with additional semantic meaning, you can use an additional word form files. The database administrator can add links between words in the file that differ in terms of definition rules but are similar in context to this database. For example, a leash and a harnesses. In different contexts, these two words may have the same meaning or be completely different.

Using a word forms file in a table

New tables

The guys from the pet store mentioned that their seller is not very knowledgeable about leashes and harnesses. They suggest that when inquiring about a leash, you should also consider asking about harnesses, and vice versa.

Since leashes and harnesses fall under the same product category and there is no specific field in the database to indicate their group, the use of a word form file can be beneficial. This file can help by allowing the addition of related words when searching for "leash." For instance, by adding words like "harness" or "flexi" to the word form dictionary, searching for "leash" will also yield results for "harness" and "flexi."

Let's examine the products available at this pet store:

Title Description Price Availability
Canvas leash, green, up to 50 kg, 5m Sturdy canvas dog leash for 5 meters, suitable for large dog breeds 5.00€ yes
Elastic leash, pink, up to 10 kg, 3m A beautiful leash for sophisticated ladies and their four-legged companions 12.00€ no
Pink harness, up to 10 kg For room guards released into the street and mistakenly called a dog  8.00€ yes
The flexi retractable dog leash, 10 kg, 5m A flexi for dogs up to 10 kg. The length is 5 meters, taking into account the length of the owner's pulled out arm 7.50€ yes
Dog food, 1kg Dry food for your pet 4.30€ no
Cat food, 1kg If your cat is yelling loudly and demanding food!  2.80€ yes
Flea collar for cats Cats shouldn't be flea carriers. 23.20€ yes
Flea drops for dogs up to 10 kg Drops from uninvited passengers on the skin of your defender 14.30€ yes

We will make the Title field just a string, use the Description for full-text search, set the Price as a floating-point number, and Availability as a Boolean value.

To build a logical word connection between "leashes," "harness," and "flexi," we will place them in a file in /tmp/wordforms/ and ensure it is shared with all users on the system.

mkdir /tmp/wordforms
cd /tmp/wordforms
mcedit wf_pet_products.txt
Enter fullscreen mode Exit fullscreen mode

The built-in editor in MC is called mcedit. To execute it, type mcedit <file name>.

Let's add our wordforms there:

flexi > leash
harness > leash
Enter fullscreen mode Exit fullscreen mode

If you do everything right, you'll see something like this in your terminal:
Image description
F2 Save
Enter Confirm
F10 Exit

Now we have a file with words that we'll use later with the new table. Remember or note down where it's saved:

/tmp/wordforms/wf_pet_products.txt
Enter fullscreen mode Exit fullscreen mode

Now, to make more changes to the database, we have to connect to it and set up a table with our columns. Then, we need to link a stemmer and the wordforms file that we made earlier:

mysql -h0 -P9306
Enter fullscreen mode Exit fullscreen mode
CREATE TABLE products (name STRING, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
Enter fullscreen mode Exit fullscreen mode

Now let's check which fields have been created in our table:

DESC products;
Enter fullscreen mode Exit fullscreen mode

Image description
All the fields that we specified are present, but an additional field has emerged - id. This field is intended for Manticore to uniquely identify a document in the database, so it is automatically created when tables are initialized, irrespective of specification. The info field has the properties indexed stored, indicating its involvement in the full-text search process. It is important to note that the order of the fields differs from the sequence specified when creating the table. Therefore, when populating the table, this must be considered, especially when updating entire rows without specifying the field sequence in the command. For instance, in the REPLACE command, which we will discuss shortly.

Next, we should verify the table's general parameters, such as the word form file and the previously connected stemmer. When creating a table, if there is an error in the wordforms file name, the system will disregard it and not produce any errors or warnings.

SHOW TABLE products SETTINGS;
Enter fullscreen mode Exit fullscreen mode

Image description
Above, you may notice that although we specified the path to the file in /tmp/wordforms/, Manticore saved it in /var/lib/manticore/products/. This indicates that Manticore has attached a copy of the file to the created table. The file is now fixed for indexing this table. Replacing the word form file on the fly is not available by design, as it would require reindexing the entire table, which may not be desirable if you have a very large table. We will discuss how to safely and predictably replace such a file later.

This completes the table setup, and the next step is to enter the required data. Data entry is done similarly to any SQL database.

INSERT INTO products (name, price, avl) VALUES ('Canvas leash, green, up to 50 kg, 5m', 5.0, 1);
INSERT INTO products (name, price, avl, info) VALUES ('Elastic leash, pink, up to 10 kg, 3m', 12.00, 0, 'A beautiful leash for sophisticated ladies and their four-legged companions');
INSERT INTO products (name, price, avl, info) VALUES ('Pink harness, up to 10 kg', 8.00, 1, 'For room guards released into the street and mistakenly called a dog');
INSERT INTO products (name, price, avl, info) VALUES ('The flexi retractable dog leash, 10 kg, 5m', 7.50, 1, 'A flexi for dogs up to 10 kg. The length is 5 meters, taking into account the length of the owner\'s pulled out arm');
INSERT INTO products (name, price, avl, info) VALUES ('Dog food, 1kg', 4.30, 0, 'Dry food for your pet');
INSERT INTO products (name, price, avl, info) VALUES ('Cat food, 1kg', 2.80, 1, 'If your cat is yelling loudly and demanding food!');
INSERT INTO products (name, price, avl, info) VALUES ('Flea collar for cats , 1kg', 23.20, 1, 'Cats shouldn\'t be flea carriers.');
INSERT INTO products (name, price, avl, info) VALUES ('Flea drops for dogs up to 10 kg', 14.30, 1, 'Drops from uninvited passengers on the skin of your defender');
Enter fullscreen mode Exit fullscreen mode

Be careful with apostrophes; there are abbreviations in the text that use them, and it is important to isolate them from the rest of the text with \: 'Isn\'t it'. Manticore does not support double quotes for strings, which would be useful to avoid escaping single apostrophes.

In the first request above, the info field was intentionally omitted to demonstrate how to update a full-text field. It's important to note that updating text fields and attribute fields is handled differently. Full-text fields are updated with the REPLACE command, which triggers reindexing of the new values, while the UPDATE command is sufficient for other fields. This is because attribute fields are not involved in the full-text search indexing process.

Now, let's use the REPLACE command to add data to certain fields of existing records. To use this command, you will need the unique ID of the row where we want to make changes or add information. First, we'll get the necessary data by using SELECT * FROM products;

SELECT * FROM products WHERE name = 'Canvas leash, green, up to 50 kg, 5m';
Enter fullscreen mode Exit fullscreen mode

Image description
Make sure to remember where the fields are located. We'll need this information later. Also, we need all the current data in the fields because the replace command will update the entire file with this data. If we don't specify all the data, the fields not specified will be reset. Fill in the REPLACE command based on the field locations.

REPLACE INTO products VALUES (<id>, <info>, <avl>, <price>, <name>);
Enter fullscreen mode Exit fullscreen mode

In the development version, which may already be released by the time you read this, it is possible to replace specific fields. For more details, refer to the documentation.

REPLACE INTO products VALUES (8217224102746783745, 'Sturdy canvas dog leash for 5 meters, suitable for large dog breeds', 1, 5.0, 'Canvas leash, green, up to 50 kg, 5m');
Enter fullscreen mode Exit fullscreen mode

Be careful with the code here, if you simply copy commands from the article, the ID field value in your table will be different! If the Manticore does not find a record with the specified ID, a new record will be created.

To update the "attributes" fields, you can use the UPDATE command. Just as a side note, while entering the data, the price dropped slightly on the green leash:

update products set price = 4.6 where id = 8217224102746783745;
Enter fullscreen mode Exit fullscreen mode

Let's check the result:

SELECT * FROM products WHERE id = 8217224102746783745;
Enter fullscreen mode Exit fullscreen mode

Image description
We have achieved the desired result, now let's try the search, because we are creating a database of products for the store, to make it easier for the seller to find the product. The command used to search in the table is select * from <table> where match('<query>').

SELECT * FROM products WHERE match('harness');
SELECT * FROM products WHERE match ('leash');
Enter fullscreen mode Exit fullscreen mode

Image description
Great, now the database is providing answers to our queries, using the connection we created through the word form file. But it seems like something is missing in the output? Where is the entry for a pink leash for indoor guards?
In the info field, there are no words like leash or collar, they only appear in the name field, so that entry didn't make it into the output. Let's fix that:

SELECT * FROM products WHERE name = 'Pink harness, up to 10 kg';
Enter fullscreen mode Exit fullscreen mode

Image description

REPLACE INTO products VALUES (8217224102746783747,'Harness for room guards released into the street and mistakenly called a dog', 1, 8.0, 'Pink harness, up to 10 kg');
Enter fullscreen mode Exit fullscreen mode

Let's check what happened:

SELECT * FROM products WHERE match('harness');
Enter fullscreen mode Exit fullscreen mode

Image description
Now this record is included in the output. As you can see from the example, only the information related to indexed fields is involved in the search; the rest of the fields are attributes for indexed fields.

Extending the table

At the pet store they brought in new stuff, now they have aquarium equipment too. To make sure we can easily find words related to aquarium stuff, like pump or drive, we need to add entries to our word list.

Title Description Price Availability
The pump for the aquarium. A pump with a built-in filter for the aquarium. Capacity 150 l/h 32.00€ yes
Automatic filter for aquarium Disposable drive with filter, capacity 100 l/h. 28.00€ yes
Fish net Atraumatic aquarium fish net 3.00€ yes

Let's add them to the database:

INSERT INTO products (name, info, price, avl) VALUES ('The pump for the aquarium.', 'Pump with built-in aquarium filter. Capacity 150 l/h', 32, 1), ('Automatic filter for aquarium', 'Disposable drive with filter, capacity 100 l/h', 28.00, 1), ('Fish net', 'Atraumatic aquarium fish net', 3, 1);
Enter fullscreen mode Exit fullscreen mode

Here we did the filling with one command with a comma-separated enumeration of new rows, so you can add a large batch of documents with one command.

Let's check the search:

Let's check the search:

SELECT * FROM products WHERE match ('pump');
SELECT * FROM products WHERE match ('filter');
Enter fullscreen mode Exit fullscreen mode

Image description
Okay, that means something. However, the drive is not considered a pump. This should be added to the list of words we use…
Let's do it.
Oh, there's a little problem here. It's not that easy...
When we create a search system for text, we place all the words into a table and convert them into tokens to enhance the efficiency of the search. These tokens are not updated subsequently to expedite the search process. Nevertheless, there are cases when we need to update the wordforms file and consequently modify the tokens as well. Let's update the list of words in our wordforms file and also include the product name in the name field to the search index.

To update the wordforms file in the table, I followed these steps:

  1. Create a dump of this table using mysqldump.
  2. Update the wordforms file.
  3. Drop the old table.
  4. Create a new table with the updated word forms in the morphology section.
  5. Populate the new table from the dump file.
Creating dump file (backup):

In any unclear situation, make a backup of the table to prevent data loss. (This rule can be displayed on the wall as a reminder).
This also helps us reach our goal.
To do this job, we need to disconnect from the SQL connection and use the mysqldump tool.

exit;
Enter fullscreen mode Exit fullscreen mode
mkdir /tmp/manticore_dumps
cd /tmp/manticore_dumps
mysqldump -h0 -P9306 -tc --compact manticore products > products.sql
Enter fullscreen mode Exit fullscreen mode

The flags I used were:

  • -t - to exclude the create table statement from the dump.
  • -c - to specify column names in the INSERT commands.
  • --compact - to perform a database dump in a compact format, which omits procedures for checking the availability of tables, deleting them, creating new ones, prescribing configurations for created tables, and other organizational actions for deployment. In our case, these procedures were not necessary, as we needed to register a new dictionary file in the table.
  • manticore - is the database name for mysqldump, and must always be manticore when using mysqldump with Manticore.
  • products - is the name of our table for which the dump is being created.
  • products.sql - is the name of the file where the dump will be uploaded. The directory for placing the file defaults to the directory from which mysqldump was launched. I recommend specifying /tmp/ to ensure the examples work successfully further. Alternatively, you can type the full path: /tmp/manticore_dumps/products.sql.

Don't worry if you see: "-- Warning: column statistics not supported by the server." It's normal when you use this mysql tool with Manticore.

Great, we have our first dump!
Image description

Updating wordforms file:

Have you forgotten where the source file for the word forms is located?

cd /tmp/wordforms
Enter fullscreen mode Exit fullscreen mode

If you are using MC and for some reason do not see columns with files, try using the Ctrl + O combination. If nothing changes, perhaps MC is not running.
Image description
Next, use the arrow keys to navigate. When you need to select a file, press F4 to open the selected file.
Add new string: "drive > pump"
Image description
Save & close. F2, Enter, F10.
File updated.

Deleting the table and creating a new one:

Before deleting the table, please make sure that you have the dump and that it definitely contains our data.

cd /tmp/manticore_dumps/
Enter fullscreen mode Exit fullscreen mode

Choose the dump file and press the F3 button to view it.
You can do the same in the console using the command cat /tmp/manticore_dumps/products.sql.

You should see something like this:
Image description
OK, let's go ahead and drop our old table and create a new one:

mysql -h0 -P9306
Enter fullscreen mode Exit fullscreen mode
DROP TABLE products;
CREATE TABLE products (name TEXT, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
Enter fullscreen mode Exit fullscreen mode

So, we removed this table, and then made it again right away (using the same command as before but with a small change: now the name field is also for text), and now we have to add information to it.

exit;
Enter fullscreen mode Exit fullscreen mode
Repopulating from the dump:
mysql -h0 -P9306 < /tmp/manticore_dumps/products.sql
mysql -h0 -P9306
Enter fullscreen mode Exit fullscreen mode

We filled it out. Let's check what we have in the end:

SELECT * FROM products WHERE match ('pump');
Enter fullscreen mode Exit fullscreen mode

Great, all working!
Image description

When updating a table, especially a large one, there is a period when the old table has been deleted and the new one has not yet been created. During this interval, the system may respond to requests with errors. To ensure smooth operations on Manticore, there are several systems in place that help avoid losing user requests. I will learn how to implement that later.

In this article, I showed how I set up and used Manticore Search for a pet store's inventory. Using wordforms and morphology, Manticore helps improve search results by linking related product names and types. I covered how I added new items, updated existing ones, and ensured data consistency during major updates like changing wordforms files. This helps beginners understand and apply Manticore Search's features effectively, making data searches more efficient. Future posts will explore more features, so let's keep experimenting with Manticore together to enhance your projects.

That's all for today. Mike, signing off.
Image description

💖 💪 🙅 🚩
anstalf
Mike

Posted on June 6, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related