Piotr
Posted on August 6, 2022
In this tutorial, I'll show you how to generate embeddings for sequences in polish using Sequence Transformers. I won't explain how they work, there are many great articles:
What we'll need is a Sequence Transformers library from Huggingface:
pip install sequence_transformers
The code is simple, we import library, create model and ask it for embeddings.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Voicelab/sbert-base-cased-pl')
embeddings = model.encode(["Ten tekst zostanie zakodowany"])
print(embeddings)
And that's it. This is the output of the model:
[[ 7.74895132e-01 7.00104088e-02 -5.02209544e-01 -2.06187874e-01
-1.28363922e-01 1.18705399e-01 -1.88303709e-01 -9.09971595e-02
...
If you wanted to change the model change Voicelab/sbert-base-cased-pl
to a model from this list, it's pre-filtered for Polish language.
Those embeddings can be pretty useful, as we could use them for classification, similarity search etc.
Example of usage
I have a list of sentences. I want to know which ones are the most similar. How could I do that? As you can guess – with embeddings. We'll calculate a distance matrix for each sentence and look which are the most similar.
sentences = [
"Pożar w mieście. Zgnięło 10 osób."
,"Wypadek pod wiaduktem kolejowym."
,"W Poniedziałek odbędzie się konferencja naukowa"
,"Magia potrafi wzniecać pożary"]
embeddings = model.encode(sentences)
I'll use cosine distance as measure of similarity.
from sklearn.metrics import pairwise
sns.heatmap(pairwise.cosine_similarity(embeddings, embeddings))
From this heatmap we can deduce that our model works, it found similarity between sentences with pożar
and wypadek
which both refer to an accident.
Pracę przygotowano w ramach realizacji projektu pt.: „Hackathon Open Gov Data oraz stworzenie innowacyjnych aplikacji, z wykorzystaniem technologii GPU”, dofinansowanego przez Ministra Edukacji i Nauki ze środków z budżetu państwa
w ramach programu „Studenckie koła naukowe tworzą innowacje”.
Posted on August 6, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.