Mike learning: REPLACE, UPDATE, wordforms
Mike
Posted on June 6, 2024
About me
Hello, it's Mike again.
I recently started working at Manticore as a Developer Advocate. I am someone not completely distant from IT, but I'm catching up with modern technologies. In this blog, I'll share my experiences and what I learn about Manticore. I plan to document my journey in a diary format, explaining what Manticore is and how to use it. Let's discover how things work together, identify issues, and engage with developers in real time.
This is my second blog post. If you are interested in learning about Manticore with me, I will keep you updated in:
Step two: REPLACE, UPDATE, wordforms
While I was writing my first article, my friends from the pet store reached out to me. They want to use the database for their store, and since their salesperson is still getting confused with leashes and collars, I want to implement additional features of the database.
This article serves as a continuation of the previous one and is based on it.
Unlike most databases, Manticore uses an advanced model to recognize the texts it stores. The text processing system is based on NLP (Natural Language Processing) solutions.
A little description of the technology for a general understanding will not hurt here. NLP is designed to recognize the “natural” language in which we communicate. At first glance, it seems that there is nothing complicated in text recognition, technically it has become so with the advent of text processing solutions using machine learning algorithms. In our case, we will not go so deep and will use ready-made text processing solutions from Manticore Search, which are already built into the database. This system uses tokenization (separation of text) into small separate parts: by sentences and by individual words, which allows you to quickly find the desired words, phrases, and paragraphs as a whole in the database. (For more info on data tokenization, check out this link.)
Here are a few words about the technologies Manticore exploits:
- Stemming is the process of reducing words to their root form. For example, "walking," "walks," and "walked" are all stems of the word "walk."
- Lemmatization is the process of changing different forms of a word back to its basic form, called a lemma. For instance, the word "eat" can appear as "eating," "eats," and "ate." The lemma for all these variations is "eat," which is its basic dictionary form.
- To improve the accuracy and quality of advanced search, several other solutions are involved in the work: word forms, exceptions, and stop words.
The lemmatizer and stemmer perform one common function of normalizing words to the same form, but in different ways, each of which has its advantages and disadvantages.
Also, stopword files with lists of common words for the chosen language, like articles, conjunctions, and exclamations, help speed up the process. Basically, all those little words that make our language sound nice but don't really mean much to a computer.
If the standard set of functions is not enough for the comfortable operation of the search engine, for example, when the database has professional jargon or local slang, and there are logical connections between words with additional semantic meaning, you can use an additional word form files. The database administrator can add links between words in the file that differ in terms of definition rules but are similar in context to this database. For example, a leash and a harnesses. In different contexts, these two words may have the same meaning or be completely different.
Using a word forms file in a table
New tables
The guys from the pet store mentioned that their seller is not very knowledgeable about leashes and harnesses. They suggest that when inquiring about a leash, you should also consider asking about harnesses, and vice versa.
Since leashes and harnesses fall under the same product category and there is no specific field in the database to indicate their group, the use of a word form file can be beneficial. This file can help by allowing the addition of related words when searching for "leash." For instance, by adding words like "harness" or "flexi" to the word form dictionary, searching for "leash" will also yield results for "harness" and "flexi."
Let's examine the products available at this pet store:
Title | Description | Price | Availability |
---|---|---|---|
Canvas leash, green, up to 50 kg, 5m | Sturdy canvas dog leash for 5 meters, suitable for large dog breeds | 5.00€ | yes |
Elastic leash, pink, up to 10 kg, 3m | A beautiful leash for sophisticated ladies and their four-legged companions | 12.00€ | no |
Pink harness, up to 10 kg | For room guards released into the street and mistakenly called a dog | 8.00€ | yes |
The flexi retractable dog leash, 10 kg, 5m | A flexi for dogs up to 10 kg. The length is 5 meters, taking into account the length of the owner's pulled out arm | 7.50€ | yes |
Dog food, 1kg | Dry food for your pet | 4.30€ | no |
Cat food, 1kg | If your cat is yelling loudly and demanding food! | 2.80€ | yes |
Flea collar for cats | Cats shouldn't be flea carriers. | 23.20€ | yes |
Flea drops for dogs up to 10 kg | Drops from uninvited passengers on the skin of your defender | 14.30€ | yes |
We will make the Title
field just a string, use the Description
for full-text search, set the Price
as a floating-point number, and Availability
as a Boolean value.
To build a logical word connection between "leashes," "harness," and "flexi," we will place them in a file in /tmp/wordforms/
and ensure it is shared with all users on the system.
mkdir /tmp/wordforms
cd /tmp/wordforms
mcedit wf_pet_products.txt
The built-in editor in MC is called mcedit
. To execute it, type mcedit <file name>
.
Let's add our wordforms there:
flexi > leash
harness > leash
If you do everything right, you'll see something like this in your terminal:
F2 Save
Enter Confirm
F10 Exit
Now we have a file with words that we'll use later with the new table. Remember or note down where it's saved:
/tmp/wordforms/wf_pet_products.txt
Now, to make more changes to the database, we have to connect to it and set up a table with our columns. Then, we need to link a stemmer and the wordforms file that we made earlier:
mysql -h0 -P9306
CREATE TABLE products (name STRING, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
Now let's check which fields have been created in our table:
DESC products;
All the fields that we specified are present, but an additional field has emerged - id
. This field is intended for Manticore to uniquely identify a document in the database, so it is automatically created when tables are initialized, irrespective of specification. The info field has the properties indexed stored
, indicating its involvement in the full-text search process. It is important to note that the order of the fields differs from the sequence specified when creating the table. Therefore, when populating the table, this must be considered, especially when updating entire rows without specifying the field sequence in the command. For instance, in the REPLACE
command, which we will discuss shortly.
Next, we should verify the table's general parameters, such as the word form file and the previously connected stemmer. When creating a table, if there is an error in the wordforms file name, the system will disregard it and not produce any errors or warnings.
SHOW TABLE products SETTINGS;
Above, you may notice that although we specified the path to the file in /tmp/wordforms/
, Manticore saved it in /var/lib/manticore/products/
. This indicates that Manticore has attached a copy of the file to the created table. The file is now fixed for indexing this table. Replacing the word form file on the fly is not available by design, as it would require reindexing the entire table, which may not be desirable if you have a very large table. We will discuss how to safely and predictably replace such a file later.
This completes the table setup, and the next step is to enter the required data. Data entry is done similarly to any SQL database.
INSERT INTO products (name, price, avl) VALUES ('Canvas leash, green, up to 50 kg, 5m', 5.0, 1);
INSERT INTO products (name, price, avl, info) VALUES ('Elastic leash, pink, up to 10 kg, 3m', 12.00, 0, 'A beautiful leash for sophisticated ladies and their four-legged companions');
INSERT INTO products (name, price, avl, info) VALUES ('Pink harness, up to 10 kg', 8.00, 1, 'For room guards released into the street and mistakenly called a dog');
INSERT INTO products (name, price, avl, info) VALUES ('The flexi retractable dog leash, 10 kg, 5m', 7.50, 1, 'A flexi for dogs up to 10 kg. The length is 5 meters, taking into account the length of the owner\'s pulled out arm');
INSERT INTO products (name, price, avl, info) VALUES ('Dog food, 1kg', 4.30, 0, 'Dry food for your pet');
INSERT INTO products (name, price, avl, info) VALUES ('Cat food, 1kg', 2.80, 1, 'If your cat is yelling loudly and demanding food!');
INSERT INTO products (name, price, avl, info) VALUES ('Flea collar for cats , 1kg', 23.20, 1, 'Cats shouldn\'t be flea carriers.');
INSERT INTO products (name, price, avl, info) VALUES ('Flea drops for dogs up to 10 kg', 14.30, 1, 'Drops from uninvited passengers on the skin of your defender');
Be careful with apostrophes; there are abbreviations in the text that use them, and it is important to isolate them from the rest of the text with
\
:'Isn\'t it'
. Manticore does not support double quotes for strings, which would be useful to avoid escaping single apostrophes.
In the first request above, the info
field was intentionally omitted to demonstrate how to update a full-text field. It's important to note that updating text fields and attribute fields is handled differently. Full-text fields are updated with the REPLACE
command, which triggers reindexing of the new values, while the UPDATE
command is sufficient for other fields. This is because attribute fields are not involved in the full-text search indexing process.
Now, let's use the REPLACE
command to add data to certain fields of existing records. To use this command, you will need the unique ID of the row where we want to make changes or add information. First, we'll get the necessary data by using SELECT * FROM products
;
SELECT * FROM products WHERE name = 'Canvas leash, green, up to 50 kg, 5m';
Make sure to remember where the fields are located. We'll need this information later. Also, we need all the current data in the fields because the replace command will update the entire file with this data. If we don't specify all the data, the fields not specified will be reset. Fill in the REPLACE
command based on the field locations.
REPLACE INTO products VALUES (<id>, <info>, <avl>, <price>, <name>);
In the development version, which may already be released by the time you read this, it is possible to replace specific fields. For more details, refer to the documentation.
REPLACE INTO products VALUES (8217224102746783745, 'Sturdy canvas dog leash for 5 meters, suitable for large dog breeds', 1, 5.0, 'Canvas leash, green, up to 50 kg, 5m');
Be careful with the code here, if you simply copy commands from the article, the ID field value in your table will be different! If the Manticore does not find a record with the specified ID, a new record will be created.
To update the "attributes" fields, you can use the UPDATE
command. Just as a side note, while entering the data, the price dropped slightly on the green leash:
update products set price = 4.6 where id = 8217224102746783745;
Let's check the result:
SELECT * FROM products WHERE id = 8217224102746783745;
We have achieved the desired result, now let's try the search, because we are creating a database of products for the store, to make it easier for the seller to find the product. The command used to search in the table is select * from <table> where match('<query>')
.
SELECT * FROM products WHERE match('harness');
SELECT * FROM products WHERE match ('leash');
Great, now the database is providing answers to our queries, using the connection we created through the word form file. But it seems like something is missing in the output? Where is the entry for a pink leash for indoor guards?
In the info
field, there are no words like leash or collar, they only appear in the name
field, so that entry didn't make it into the output. Let's fix that:
SELECT * FROM products WHERE name = 'Pink harness, up to 10 kg';
REPLACE INTO products VALUES (8217224102746783747,'Harness for room guards released into the street and mistakenly called a dog', 1, 8.0, 'Pink harness, up to 10 kg');
Let's check what happened:
SELECT * FROM products WHERE match('harness');
Now this record is included in the output. As you can see from the example, only the information related to indexed fields is involved in the search; the rest of the fields are attributes for indexed fields.
Extending the table
At the pet store they brought in new stuff, now they have aquarium equipment too. To make sure we can easily find words related to aquarium stuff, like pump or drive, we need to add entries to our word list.
Title | Description | Price | Availability |
---|---|---|---|
The pump for the aquarium. | A pump with a built-in filter for the aquarium. Capacity 150 l/h | 32.00€ | yes |
Automatic filter for aquarium | Disposable drive with filter, capacity 100 l/h. | 28.00€ | yes |
Fish net | Atraumatic aquarium fish net | 3.00€ | yes |
Let's add them to the database:
INSERT INTO products (name, info, price, avl) VALUES ('The pump for the aquarium.', 'Pump with built-in aquarium filter. Capacity 150 l/h', 32, 1), ('Automatic filter for aquarium', 'Disposable drive with filter, capacity 100 l/h', 28.00, 1), ('Fish net', 'Atraumatic aquarium fish net', 3, 1);
Here we did the filling with one command with a comma-separated enumeration of new rows, so you can add a large batch of documents with one command.
Let's check the search:
Let's check the search:
SELECT * FROM products WHERE match ('pump');
SELECT * FROM products WHERE match ('filter');
Okay, that means something. However, the drive is not considered a pump. This should be added to the list of words we use…
Let's do it.
Oh, there's a little problem here. It's not that easy...
When we create a search system for text, we place all the words into a table and convert them into tokens to enhance the efficiency of the search. These tokens are not updated subsequently to expedite the search process. Nevertheless, there are cases when we need to update the wordforms file and consequently modify the tokens as well. Let's update the list of words in our wordforms file and also include the product name in the name
field to the search index.
To update the wordforms file in the table, I followed these steps:
- Create a dump of this table using
mysqldump
. - Update the wordforms file.
- Drop the old table.
- Create a new table with the updated word forms in the morphology section.
- Populate the new table from the dump file.
Creating dump file (backup):
In any unclear situation, make a backup of the table to prevent data loss. (This rule can be displayed on the wall as a reminder).
This also helps us reach our goal.
To do this job, we need to disconnect from the SQL connection and use the mysqldump
tool.
exit;
mkdir /tmp/manticore_dumps
cd /tmp/manticore_dumps
mysqldump -h0 -P9306 -tc --compact manticore products > products.sql
The flags I used were:
-
-t
- to exclude thecreate table
statement from the dump. -
-c
- to specify column names in theINSERT
commands. -
--compact
- to perform a database dump in a compact format, which omits procedures for checking the availability of tables, deleting them, creating new ones, prescribing configurations for created tables, and other organizational actions for deployment. In our case, these procedures were not necessary, as we needed to register a new dictionary file in the table. -
manticore
- is the database name formysqldump
, and must always bemanticore
when usingmysqldump
with Manticore. -
products
- is the name of our table for which the dump is being created. -
products.sql
- is the name of the file where the dump will be uploaded. The directory for placing the file defaults to the directory from whichmysqldump
was launched. I recommend specifying/tmp/
to ensure the examples work successfully further. Alternatively, you can type the full path:/tmp/manticore_dumps/products.sql
.
Don't worry if you see: "-- Warning: column statistics not supported by the server." It's normal when you use this mysql tool with Manticore.
Great, we have our first dump!
Updating wordforms file:
Have you forgotten where the source file for the word forms is located?
cd /tmp/wordforms
If you are using MC and for some reason do not see columns with files, try using the Ctrl + O combination. If nothing changes, perhaps MC is not running.
Next, use the arrow keys to navigate. When you need to select a file, press F4 to open the selected file.
Add new string: "drive > pump"
Save & close. F2, Enter, F10.
File updated.
Deleting the table and creating a new one:
Before deleting the table, please make sure that you have the dump and that it definitely contains our data.
cd /tmp/manticore_dumps/
Choose the dump file and press the F3 button to view it.
You can do the same in the console using the command cat /tmp/manticore_dumps/products.sql
.
You should see something like this:
OK, let's go ahead and drop our old table and create a new one:
mysql -h0 -P9306
DROP TABLE products;
CREATE TABLE products (name TEXT, info TEXT, price FLOAT, avl BOOL) morphology = 'stem_en' wordforms='/tmp/wordforms/wf_pet_products.txt';
So, we removed this table, and then made it again right away (using the same command as before but with a small change: now the name field is also for text), and now we have to add information to it.
exit;
Repopulating from the dump:
mysql -h0 -P9306 < /tmp/manticore_dumps/products.sql
mysql -h0 -P9306
We filled it out. Let's check what we have in the end:
SELECT * FROM products WHERE match ('pump');
When updating a table, especially a large one, there is a period when the old table has been deleted and the new one has not yet been created. During this interval, the system may respond to requests with errors. To ensure smooth operations on Manticore, there are several systems in place that help avoid losing user requests. I will learn how to implement that later.
In this article, I showed how I set up and used Manticore Search for a pet store's inventory. Using wordforms and morphology, Manticore helps improve search results by linking related product names and types. I covered how I added new items, updated existing ones, and ensured data consistency during major updates like changing wordforms files. This helps beginners understand and apply Manticore Search's features effectively, making data searches more efficient. Future posts will explore more features, so let's keep experimenting with Manticore together to enhance your projects.
Posted on June 6, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024