d[IA]gnosis: Vectorizing Diagnostics with Embedded Python and LLM Models

In the previous article we presented the d[IA]gnosis application developed to support the coding of diagnoses in ICD-10. In this article we will see how InterSystems IRIS for Health provides us with the necessary tools for the generation of vectors from the ICD-10 code list using a pre-trained language model, its storage and the subsequent search for similarities on all these generated vectors.

Introduction

One of the main features that have emerged with the development of AI models is what we know as RAG (Retrieval-Augmented Generation) that allows us to improve the results of LLM models by incorporating a context into the model. Well, in our example the context is given by the set of ICD-10 diagnoses and to use them we must first vectorize them.

How to vectorize our list of diagnoses?

SentenceTransformers and Embedded Python

For the generation of vectors we have used the Python library SentenceTransformers which greatly facilitates the vectorization of free texts from pre-trained models. From their own website:

Sentence Transformers (a.k.a. SBERT) is the go-to Python module for accessing, using, and training state-of-the-art text and image embedding models. It can be used to compute embeddings using Sentence Transformer models (quickstart) or to calculate similarity scores using Cross-Encoder models (quickstart). This unlocks a wide range of applications, including semantic search, semantic textual similarity, and paraphrase mining.

Among all the models developed by the SentenceTransformers community we have found BioLORD-2023-M, a pre-trained model that will generate 786-dimensional vectors.

This model was trained using BioLORD, a new pre-training strategy for producing meaningful representations for clinical sentences and biomedical concepts.

State-of-the-art methodologies operate by maximizing the similarity in representation of names referring to the same concept, and preventing collapse through contrastive learning. However, because biomedical names are not always self-explanatory, it sometimes results in non-semantic representations.

BioLORD overcomes this issue by grounding its concept representations using definitions, as well as short descriptions derived from a multi-relational knowledge graph consisting of biomedical ontologies. Thanks to this grounding, our model produces more semantic concept representations that match more closely the hierarchical structure of ontologies. BioLORD-2023 establishes a new state of the art for text similarity on both clinical sentences (MedSTS) and biomedical concepts (EHR-Rel-B).

As you can see in its definition, this model is pre-trained with medical concepts that will be useful when vectorizing both our ICD-10 codes and free text.

For our project, we will download this model to speed up the vectors creation:

if not os.path.isdir('/shared/model/'):
    model = sentence_transformers.SentenceTransformer('FremyCompany/BioLORD-2023-M')            
    model.save('/shared/model/')

Once in our team, we can enter the texts to vectorize in lists to speed up the process, let's see how we vectorize the ICD-10 codes that we have previously recorded in our ENCODER.Object.Codes class.

st = iris.sql.prepare("SELECT TOP 50 CodeId, Description FROM ENCODER_Object.Codes WHERE VectorDescription is null ORDER BY ID ASC ")
resultSet = st.execute()
df = resultSet.dataframe()

if (df.size > 0):
    model = sentence_transformers.SentenceTransformer("/shared/model/")
    embeddings = model.encode(df['description'].tolist(), normalize_embeddings=True)

    df['vectordescription'] = embeddings.tolist()

    stmt = iris.sql.prepare("UPDATE ENCODER_Object.Codes SET VectorDescription = TO_VECTOR(?,DECIMAL) WHERE CodeId = ?")
    for index, row in df.iterrows():
        rs = stmt.execute(str(row['vectordescription']), row['codeid'])
else:
    flagLoop = False

As you can see, we first extract the codes stored in our ICD-10 code table that we have not yet vectorized but that we have recorded in a previous step after extracting it from the CSV file, then we extract the list of descriptions to vectorize and using the Python sentence_transformers library we will recover our model and generate the associated embeddings.

Finally, we will update the ICD-10 code with the vectorized description by executing the UPDATE. As you can see, the command to vectorize the result returned by the model is the SQL command TO_VECTOR in IRIS.

Using it in IRIS

Okay, we have our Python code, so we just need to wrap it in a class that extends Ens.BusinessProcess and include it in our production, then connect it to the Business Service in charge of retrieving the CSV file and that's it!

Let's take a look at what this code will look like in our production:

As you can see, we have our Business Service with the EnsLib.File.InboundAdapter adapter that will allow us to collect the code file and redirect it to our Business Process in which we will perform all the vectorization and storage operations, giving us a set of records like the following:

Now our application would be ready to start looking for possible matches with the texts we send it!

In the following article...

In the next article we will show how the application front-end developed in Angular 17 is integrated with our production in IRIS for Health and how IRIS receives the texts to be analyzed, vectorizes them and searches for similarities in the ICD-10 code table.

Don't miss it!

Blog

d[IA]gnosis: Vectorizing Diagnostics with Embedded Python and LLM Models

InterSystems Developer

Introduction

SentenceTransformers and Embedded Python

Using it in IRIS

In the following article...

Join Our Newsletter. No Spam, Only the good stuff.

Related